[DRBD-user] b->n_req != set_size

Sun Mar 9 18:34:53 CET 2008

On Sun, 2008-03-09 at 14:35 +0100, Lars Ellenberg wrote:
> On Sat, Mar 08, 2008 at 02:27:04PM -0800, Tom Brown wrote:
> > Hardware: 2 Dell PowerEdge SC1435's
> > Mirrored Drive: 1TB Hitachi HUA72101 SATA
> > OS: Debian Etch 4.0r3
> > Kernel: vanilla kernel 2.6.24.3
> > DRBD: 8.2.5
> > Heartbeat: 2.1.3
> > 
> > I had drbd and heartbeat up and running. I did initial tests of the
> > fail-over and mirroring. Everything worked as expected. Then I attached
> > an external drive via firewire to a SIIG firewire card in the primary
> > node. I mounted the external drive on /backup. The /dev/drbd0 device is
> > mounted on /ha. Then I issued the following command at 17:50 and left
> > for the night:
> > 
> > tar cf /ha/fullbackup.tar /backup/ha
> > 
> > The /backup/ha directory contains 334GB of data. When I came in to work
> > this moring, I issued an 'ls -lh /ha' command and it hung. I checked
> > syslog and found this:
> > 
> > Mar  7 20:03:02 fs01 kernel: drbd0: FIXME (barrier_acked but pending)
> > f6af0688 W L-coNp-s-- 82821 (621446208s +4096) Connected
> 
> a write request, belongin into epoch 82821,
>  4kB in size, to sector 621446208,
> locally completed ok,
> actually handed over to the tcp stack,
> but still waiting on the "Ack" from the remote node.
> 
> the barrier ack which closes this epoch
> came in before the Ack, which also explains
> why the barrier ack'ed set_size is one less than expected.
> 
> This "should not happen".
> it still does not explain why it hangs after that...
> 
> > Mar  7 20:03:02 fs01 kernel: drbd0: ASSERT( b->n_req == set_size )
> > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:238
> > Mar  7 20:03:02 fs01 kernel: drbd0: b->n_req = 592
> > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:246
> > Mar  7 20:03:02 fs01 kernel: drbd0: set_size = 591
> > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:247
> >
> > Any access to /ha hangs. The tar command is hung. I found a post from
> > January 10, 2005 that the second line in the log is nothing to worry
> > about. I looked in drbd_main.c and didn't see anything that indicated a
> > major problem. It looks like it just reports the sizes of b->n_req and
> > set_size when they are not equal.
> > 
> > What does it mean when b->n_req != set_size? Is this an indicator of why
> > the drbd0 device is not accessible anymore? Or is the first line from
> > the log above (where it says FIXME) an indicator of a bigger problem? 
> > 
> > I had to restart the primary. I could access the drbd0 device after it
> > came back up. I found that the last write to the tar file was at 20:02.
> > That's about the same time those errors showed up in the log.
> > 
> > Well I found some posts about a Broadcom NetXtreme II BCM5708 NIC with
> > TOE causing drbd lockups. I am using an onboard Broadcom NetXtreme
> > BCM5721 NIC without TOE. One of the posts said to try this:
> > 
> > ethtool -K ethX tx off
> > ethtool -K ethX rx off
> > 
> > Which I did and tried it again. This time it worked. Did I fix the
> > problem, or just get lucky? Any ideas?
> 
> There may be some race within DRBD.
> There may just be data corruption (bit flips) on the wire.
> it is also possible that there is some incompatibility between
> DRBD's expectations for some things and how it is actualy done
> in kernel 2.6.24.
> 
> Can you give me a "cat /proc/drbd" please?
> (preferable from the time when it "hung".)
> 
> well, you could also just tell me the git-hash of the DRBD "8.2.5"
> and the drbd protocol you have been using, and I'll try to "imagine"
> possible problems while meditating over pieces of source code...

Here's the 'cat /proc/drbd' while it was hung:

version: 8.2.5 (api:88/proto:86-88)
GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by tbrown at fs01,
2008-03-06 08:12:19
 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:318254844 nr:160 dw:318255004 dr:9762 al:80148 bm:1 lo:0 pe:1
ua:0 ap:1
        resync: used:0/31 hits:5 misses:1 starving:0 dirty:0 changed:1
        act_log: used:1/257 hits:79483563 misses:83224 starving:0
dirty:3076 changed:80148

I've been able to write another ~300GB to the drbd0 device without a
problem.

Thanks,
Tom