[DRBD-user] b->n_req != set_size

Tom Brown brown at esteem.com
Mon Mar 10 01:26:42 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Sun, 2008-03-09 at 17:06 -0700, Tom Brown wrote:
> On Sun, 2008-03-09 at 22:58 +0100, Lars Ellenberg wrote:
> > On Sun, Mar 09, 2008 at 10:34:53AM -0700, Tom Brown wrote:
> > > On Sun, 2008-03-09 at 14:35 +0100, Lars Ellenberg wrote:
> > > > On Sat, Mar 08, 2008 at 02:27:04PM -0800, Tom Brown wrote:
> > > > > Hardware: 2 Dell PowerEdge SC1435's
> > > > > Mirrored Drive: 1TB Hitachi HUA72101 SATA
> > > > > OS: Debian Etch 4.0r3
> > > > > Kernel: vanilla kernel 2.6.24.3
> > > > > DRBD: 8.2.5
> > > > > Heartbeat: 2.1.3
> > > > > 
> > > > > I had drbd and heartbeat up and running. I did initial tests of the
> > > > > fail-over and mirroring. Everything worked as expected. Then I attached
> > > > > an external drive via firewire to a SIIG firewire card in the primary
> > > > > node. I mounted the external drive on /backup. The /dev/drbd0 device is
> > > > > mounted on /ha. Then I issued the following command at 17:50 and left
> > > > > for the night:
> > > > > 
> > > > > tar cf /ha/fullbackup.tar /backup/ha
> > > > > 
> > > > > The /backup/ha directory contains 334GB of data. When I came in to work
> > > > > this moring, I issued an 'ls -lh /ha' command and it hung. I checked
> > > > > syslog and found this:
> > > > > 
> > > > > Mar  7 20:03:02 fs01 kernel: drbd0: FIXME (barrier_acked but pending)
> > > > > f6af0688 W L-coNp-s-- 82821 (621446208s +4096) Connected
> > > > 
> > > > a write request, belongin into epoch 82821,
> > > >  4kB in size, to sector 621446208,
> > > > locally completed ok,
> > > > actually handed over to the tcp stack,
> > > > but still waiting on the "Ack" from the remote node.
> > > > 
> > > > the barrier ack which closes this epoch
> > > > came in before the Ack, which also explains
> > > > why the barrier ack'ed set_size is one less than expected.
> > > > 
> > > > This "should not happen".
> > > > it still does not explain why it hangs after that...
> > > > 
> > > > > Mar  7 20:03:02 fs01 kernel: drbd0: ASSERT( b->n_req == set_size )
> > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:238
> > > > > Mar  7 20:03:02 fs01 kernel: drbd0: b->n_req = 592
> > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:246
> > > > > Mar  7 20:03:02 fs01 kernel: drbd0: set_size = 591
> > > > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:247
> > > > >
> > > > > Any access to /ha hangs. The tar command is hung. I found a post from
> > > > > January 10, 2005 that the second line in the log is nothing to worry
> > > > > about. I looked in drbd_main.c and didn't see anything that indicated a
> > > > > major problem. It looks like it just reports the sizes of b->n_req and
> > > > > set_size when they are not equal.
> > > > > 
> > > > > What does it mean when b->n_req != set_size? Is this an indicator of why
> > > > > the drbd0 device is not accessible anymore? Or is the first line from
> > > > > the log above (where it says FIXME) an indicator of a bigger problem? 
> > > > > 
> > > > > I had to restart the primary. I could access the drbd0 device after it
> > > > > came back up. I found that the last write to the tar file was at 20:02.
> > > > > That's about the same time those errors showed up in the log.
> > > > > 
> > > > > Well I found some posts about a Broadcom NetXtreme II BCM5708 NIC with
> > > > > TOE causing drbd lockups. I am using an onboard Broadcom NetXtreme
> > > > > BCM5721 NIC without TOE. One of the posts said to try this:
> > > > > 
> > > > > ethtool -K ethX tx off
> > > > > ethtool -K ethX rx off
> > > > > 
> > > > > Which I did and tried it again. This time it worked. Did I fix the
> > > > > problem, or just get lucky? Any ideas?
> > > > 
> > > > There may be some race within DRBD.
> > > > There may just be data corruption (bit flips) on the wire.
> > > > it is also possible that there is some incompatibility between
> > > > DRBD's expectations for some things and how it is actualy done
> > > > in kernel 2.6.24.
> > > > 
> > > > Can you give me a "cat /proc/drbd" please?
> > > > (preferable from the time when it "hung".)
> > > > 
> > > > well, you could also just tell me the git-hash of the DRBD "8.2.5"
> > > > and the drbd protocol you have been using, and I'll try to "imagine"
> > > > possible problems while meditating over pieces of source code...
> > > 
> > > Here's the 'cat /proc/drbd' while it was hung:
> > > 
> > > version: 8.2.5 (api:88/proto:86-88)
> > > GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by tbrown at fs01,
> > > 2008-03-06 08:12:19
> > >  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> > >     ns:318254844 nr:160 dw:318255004 dr:9762 al:80148 bm:1 lo:0 pe:1
> > > ua:0 ap:1
> > >         resync: used:0/31 hits:5 misses:1 starving:0 dirty:0 changed:1
> > >         act_log: used:1/257 hits:79483563 misses:83224 starving:0
> > > dirty:3076 changed:80148
> > 
> > great.
> > so we have ap:1 one application request still on the fly,
> > which happens to be pe:1 pending (waiting to be ACKed by the other node)
> > 
> > probably a file system journal commit, so the file system is waiting for
> > this one to complete eventually, which never happens. thus "hung".
> > 
> > it never got ACKed, which fits in with that strange one-off mismatch
> > between b->n_req and set_size in the kernel messages above.
> > this makes me speculate a lot of things...
> > which I don't want to bother anyone with right now, because maybe you
> > have some more non-speculative facts?
> > 
> > /proc/drbd on the secondary at that time?
> > kernel messages/syslog on the secondary during that event?
> > anything more than the above four log lines on the Primary?
> 
> I don't have anthing other that what I posted. I should know better than
> that. Please speculate.

Wait. I did find this in the log of the secondary:

Mar  7 20:03:02 fs02 kernel: drbd0: skipping unknown optional packet
type 8192, l: 4120!

This is the same time as the FIXME log above, on the primary. That's the
time the hang occurred.

Thanks,
Tom





More information about the drbd-user mailing list