[DRBD-user] b->n_req != set_size

Mon Mar 10 11:55:14 CET 2008

On Sun, Mar 09, 2008 at 05:26:42PM -0700, Tom Brown wrote:
> 
> On Sun, 2008-03-09 at 17:06 -0700, Tom Brown wrote:
> > On Sun, 2008-03-09 at 22:58 +0100, Lars Ellenberg wrote:
> > > On Sun, Mar 09, 2008 at 10:34:53AM -0700, Tom Brown wrote:
> > > > On Sun, 2008-03-09 at 14:35 +0100, Lars Ellenberg wrote:
> > > > > On Sat, Mar 08, 2008 at 02:27:04PM -0800, Tom Brown wrote:
> > > > > > Hardware: 2 Dell PowerEdge SC1435's
> > > > > > Mirrored Drive: 1TB Hitachi HUA72101 SATA
> > > > > > OS: Debian Etch 4.0r3
> > > > > > Kernel: vanilla kernel 2.6.24.3
> > > > > > DRBD: 8.2.5
> > > > > > Heartbeat: 2.1.3
> > > > > > 
> > > > > > I had drbd and heartbeat up and running. I did initial tests of the
> > > > > > fail-over and mirroring. Everything worked as expected. Then I attached
> > > > > > an external drive via firewire to a SIIG firewire card in the primary
> > > > > > node. I mounted the external drive on /backup. The /dev/drbd0 device is
> > > > > > mounted on /ha. Then I issued the following command at 17:50 and left
> > > > > > for the night:
> > > > > > 
> > > > > > tar cf /ha/fullbackup.tar /backup/ha
> > > > > > 
> > > > > > The /backup/ha directory contains 334GB of data. When I came in to work
> > > > > > this moring, I issued an 'ls -lh /ha' command and it hung. I checked
> > > > > > syslog and found this:
> > > > > > 
> > > > > > Mar  7 20:03:02 fs01 kernel: drbd0: FIXME (barrier_acked but pending)
> > > > > > f6af0688 W L-coNp-s-- 82821 (621446208s +4096) Connected
> > > > > 
> > > > > a write request, belongin into epoch 82821,
> > > > >  4kB in size, to sector 621446208,
> > > > > locally completed ok,
> > > > > actually handed over to the tcp stack,
> > > > > but still waiting on the "Ack" from the remote node.
> > > > > 
> > > > > the barrier ack which closes this epoch
> > > > > came in before the Ack, which also explains
> > > > > why the barrier ack'ed set_size is one less than expected.
> > > > > 
> > > > > This "should not happen".
> > > > > it still does not explain why it hangs after that...
> > > > > 
> > > > > > Mar  7 20:03:02 fs01 kernel: drbd0: ASSERT( b->n_req == set_size )
> > > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:238
> > > > > > Mar  7 20:03:02 fs01 kernel: drbd0: b->n_req = 592
> > > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:246
> > > > > > Mar  7 20:03:02 fs01 kernel: drbd0: set_size = 591
> > > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:247
> > > > > >
> > > > > > Any access to /ha hangs. The tar command is hung. I found a post from
> > > > > > January 10, 2005 that the second line in the log is nothing to worry
> > > > > > about. I looked in drbd_main.c and didn't see anything that indicated a
> > > > > > major problem. It looks like it just reports the sizes of b->n_req and
> > > > > > set_size when they are not equal.
> > > > > > 
> > > > > > What does it mean when b->n_req != set_size? Is this an indicator of why
> > > > > > the drbd0 device is not accessible anymore? Or is the first line from
> > > > > > the log above (where it says FIXME) an indicator of a bigger problem? 
> > > > > > 
> > > > > > I had to restart the primary. I could access the drbd0 device after it
> > > > > > came back up. I found that the last write to the tar file was at 20:02.
> > > > > > That's about the same time those errors showed up in the log.
> > > > > > 
> > > > > > Well I found some posts about a Broadcom NetXtreme II BCM5708 NIC with
> > > > > > TOE causing drbd lockups. I am using an onboard Broadcom NetXtreme
> > > > > > BCM5721 NIC without TOE. One of the posts said to try this:
> > > > > > 
> > > > > > ethtool -K ethX tx off
> > > > > > ethtool -K ethX rx off
> > > > > > 
> > > > > > Which I did and tried it again. This time it worked. Did I fix the
> > > > > > problem, or just get lucky? Any ideas?
> > > > > 
> > > > > There may be some race within DRBD.
> > > > > There may just be data corruption (bit flips) on the wire.
> > > > > it is also possible that there is some incompatibility between
> > > > > DRBD's expectations for some things and how it is actualy done
> > > > > in kernel 2.6.24.
> > > > > 
> > > > > Can you give me a "cat /proc/drbd" please?
> > > > > (preferable from the time when it "hung".)
> > > > > 
> > > > > well, you could also just tell me the git-hash of the DRBD "8.2.5"
> > > > > and the drbd protocol you have been using, and I'll try to "imagine"
> > > > > possible problems while meditating over pieces of source code...
> > > > 
> > > > Here's the 'cat /proc/drbd' while it was hung:
> > > > 
> > > > version: 8.2.5 (api:88/proto:86-88)
> > > > GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by tbrown at fs01,
> > > > 2008-03-06 08:12:19
> > > >  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> > > >     ns:318254844 nr:160 dw:318255004 dr:9762 al:80148 bm:1 lo:0 pe:1
> > > > ua:0 ap:1
> > > >         resync: used:0/31 hits:5 misses:1 starving:0 dirty:0 changed:1
> > > >         act_log: used:1/257 hits:79483563 misses:83224 starving:0
> > > > dirty:3076 changed:80148
> > > 
> > > great.
> > > so we have ap:1 one application request still on the fly,
> > > which happens to be pe:1 pending (waiting to be ACKed by the other node)
> > > 
> > > probably a file system journal commit, so the file system is waiting for
> > > this one to complete eventually, which never happens. thus "hung".
> > > 
> > > it never got ACKed, which fits in with that strange one-off mismatch
> > > between b->n_req and set_size in the kernel messages above.
> > > this makes me speculate a lot of things...
> > > which I don't want to bother anyone with right now, because maybe you
> > > have some more non-speculative facts?
> > > 
> > > /proc/drbd on the secondary at that time?
> > > kernel messages/syslog on the secondary during that event?
> > > anything more than the above four log lines on the Primary?
> > 
> > I don't have anthing other that what I posted. I should know better than
> > that. Please speculate.
> 
> Wait. I did find this in the log of the secondary:
> 
> Mar  7 20:03:02 fs02 kernel: drbd0: skipping unknown optional packet
> type 8192, l: 4120!
> 
> This is the same time as the FIXME log above, on the primary. That's the
> time the hang occurred.

THIS is the one.
Thank you. Now I can spare you my speculations :)

each and every drbd packet on the wire has a header, which is
u32       magic;
u16       command; /* reported as "type" above */
u16       length;  /* reported as "l: ", bytes after this header */

a Data packet would have "type" == 0,
you got type "8192", which is a bitflip at position 13.

a Data packet then has additionally 24 bytes info,
u64         sector;    /* 64 bits sector number */
u64         block_id;  /* to identify the request in protocol B&C */
u32         seq_num;
u32         dp_flags;
followed by
[optionally checksum hash]
[actuall data payload]

what happened here is that something on the "wire",
presumably your NIC or bus, corrupted data.  which resulted in a "Data"
command being not recognized.  there currently is no "optional packet"
in drbd, only the infrastructure in case we would eventually invent one.

so the payload (4120 byte, which happens to be exactly 4k + 24 byte)
was just discarded.
data never received, epoch size too small, write never acknowledged.

so switching off the checksum offloading has helped,
if we assume the bitflip happened on transfer to/from the NIC.
when the checksum is calculated/verified by the kernel when data is
still/already in main memory, those can be cached and trigger a resent.

in drbd, you may want to switch on "data integrity alg",
which will calculate a digest hash over the actual data payload,
and insert that into the data packet, as indicated above.

it is likely that there would have been a bitflip in the data chunk as
well, which would have been caught.  a bitflip in the header/packet info
only would not be caught, though!  (we may want to add a crc there, too)

Note:
because you may have had any number of undetected bit-flips in the data
portion of replicated writes, to recover from this, I strongly recommend
to switch on data-integrity alg, and do an online-verify.

and, let me again say "thank you".
this was a very good example from the book about
 how to do bug reports right.
(thats we I did not trimm the quotes at all,
 just to keep the good example intact).

catchy subject right on the point,
precise (but not too much) information
about the setup, as well as the problem,
pre-parsed prepared log extracts, again to the point,
reports about own investigation,
a proposed workaround, own mental input.

really, this thread cheered me up a bit,
given the growing number of not-so-interessting threads here recently.

cheers,

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.