[DRBD-user] DRBD blocked for more than 120 seconds on CentOS 6.0 (FAIL)

Tue Nov 1 12:22:44 CET 2011

On Tue, 1 Nov 2011 08:00:32 +0100
Igmar Palsenberg <drbd at palsenberg.com> wrote:

> > lock up and backtrace::::::::::::::::::::::::::::::::::::::::
> > 
> > INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message. drbd_w_r1     D ffff880818639900     0  4599      2
> > 0x00000080 ffff88080dd93c70 0000000000000046 ffff88081a9b8af8
> > ffff8800282569f0 ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40
> > 000000000000056c ffff88081a9b8638 ffff88080dd93fd8 0000000000010518
> > ffff88081a9b8638 Call Trace:
> > [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
> > [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
> > [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
> > [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
> > [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
> > [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
> > [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
> > [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
> > [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
> > [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
> > [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
> > [<ffffffff810141ca>] child_rip+0xa/0x20
> > [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
> > [<ffffffff810141c0>] ? child_rip+0x0/0x20
> 
> It's waiting for disk IO. What DRBD proto version are you using ?
Protocol A, but it was administratively disconnected for a few hours well before the time of the crash.

> > Initializing cgroup subsys cpuset
> > Initializing cgroup subsys cpu
> > Linux version 2.6.32-71.29.1.el6.x86_64
> > (mockbuild at c6b5.bsys.dev.centos.org) (gcc version 4.4.4 20100726
> > (Red Hat 4.4.4-13) (GCC) ) #1 SMP Mon Jun 27
> 
> My advice : Test again with a vanilla kernel. RedHat kernels have
> tons of stuff backported, so that version number hardly reflects the
> actual state of the kernel. I've ran into issues with backported
> features before.
Hmm .. sounds like good advice ... the trouble is that the fault was only observed after around a few hours of continuous operation, and on a workload identical to the previous hour of workload.  A previous similar workload ran for 16 hours and produced no failures.  If it was possible to trigger the failure on demand, it would be easier to isolate the cause.

> > d-con r1: peer( Secondary -> Unknown ) conn( Connected ->
> > Disconnecting ) pdsk( UpToDate -> DUnknown ) d-con r1: asender
> > terminated d-con r1: Terminating asender thread
> > block drbd0: new current UUID
> > 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A
> > d-con r1: Connection closed d-con r1: conn( Disconnecting ->
> > StandAlone ) d-con r1: receiver terminated
> > d-con r1: Terminating receiver thread
> 
> The syncer connection got disconnected. That's usually a bad sign.
> Why was the disconnect in the first place ? It might be related to
> the disk IO failing.
The syncer was disconnected by drbdadm disconnect all (or similar).
> 
> > INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message. drbd_w_r1     D ffff880818639900     0  4599      2
> > 0x00000080 ffff88080dd93c70 0000000000000046 ffff88081a9b8af8
> > ffff8800282569f0 ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40
> > 000000000000056c ffff88081a9b8638 ffff88080dd93fd8 0000000000010518
> > ffff88081a9b8638 Call Trace:
> > [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
> > [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
> > [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
> > [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
> > [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
> > [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
> > [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
> > [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
> > [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
> > [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
> > [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
> > [<ffffffff810141ca>] child_rip+0xa/0x20
> > [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
> > [<ffffffff810141c0>] ? child_rip+0x0/0x20
> 
> You're sure your slave can keep up with this machine ? I've seen
> cases where things went bad because the other side's IO subsystems
> where waaaaaaay slower then the masters, so it kept lagging behind,
> and eventually things broke.
The slave hardware is identical to the master, but being disconnected it should not be a factor in this breakage.  (It is disconnected since we require maximum speed for the import operation).  The only thing that should happen here is that the bitmap of changes should be filled up.

&:-)