[DRBD-user] DRBD blocked for more than 120 seconds on CentOS 6.0 (FAIL)

Tue Nov 1 08:00:32 CET 2011

> lock up and backtrace::::::::::::::::::::::::::::::::::::::::
> 
> INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> drbd_w_r1     D ffff880818639900     0  4599      2 0x00000080
> ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0
> ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c
> ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638
> Call Trace:
> [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
> [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
> [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
> [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
> [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
> [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
> [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
> [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
> [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
> [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
> [<ffffffff810141ca>] child_rip+0xa/0x20
> [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
> [<ffffffff810141c0>] ? child_rip+0x0/0x20

It's waiting for disk IO. What DRBD proto version are you using ?

> Initializing cgroup subsys cpuset
> Initializing cgroup subsys cpu
> Linux version 2.6.32-71.29.1.el6.x86_64 (mockbuild at c6b5.bsys.dev.centos.org) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Mon Jun 27

My advice : Test again with a vanilla kernel. RedHat kernels have tons of stuff backported, so that version number hardly reflects the actual state of the kernel. I've ran into issues with backported features before.

> d-con r1: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
> d-con r1: asender terminated
> d-con r1: Terminating asender thread
> block drbd0: new current UUID 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A
> d-con r1: Connection closed
> d-con r1: conn( Disconnecting -> StandAlone ) 
> d-con r1: receiver terminated
> d-con r1: Terminating receiver thread

The syncer connection got disconnected. That's usually a bad sign. Why was the disconnect in the first place ? It might be related to the disk IO failing.

> INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> drbd_w_r1     D ffff880818639900     0  4599      2 0x00000080
> ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0
> ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c
> ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638
> Call Trace:
> [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
> [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
> [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
> [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
> [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
> [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
> [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
> [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
> [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
> [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
> [<ffffffff810141ca>] child_rip+0xa/0x20
> [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
> [<ffffffff810141c0>] ? child_rip+0x0/0x20

You're sure your slave can keep up with this machine ? I've seen cases where things went bad because the other side's IO subsystems where waaaaaaay slower then the masters, so it kept lagging behind, and eventually things broke.

	Igmar