[DRBD-user] DRBD blocked for more than 120 seconds on CentOS 6.0 (FAIL)

Tue Nov 1 14:16:54 CET 2011

Hi Igmar,

trying to rectify a couple of misconceptions here, for people pulling
this thread from the list archives.

On 2011-11-01 08:00, Igmar Palsenberg wrote:
>> d-con r1: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
>> d-con r1: asender terminated
>> d-con r1: Terminating asender thread
>> block drbd0: new current UUID 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A
>> d-con r1: Connection closed
>> d-con r1: conn( Disconnecting -> StandAlone ) 
>> d-con r1: receiver terminated
>> d-con r1: Terminating receiver thread
> 
> The syncer connection got disconnected. That's usually a bad sign. Why was the disconnect in the first place ? It might be related to the disk IO failing.

Really? How so?

Disconnect applies to the _network_ connection, and to that only. If you
had an issue with local disk I/O, and you had on-io-error set to detach
(default in 8.4.0+), then a detach message would be what's expected. Not
a disconnect.

As Andrew points out, though, this was evidently an administrative
disconnect, so not related to any failure at all.

>> INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> drbd_w_r1     D ffff880818639900     0  4599      2 0x00000080
>> ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0
>> ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c
>> ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638
>> Call Trace:
>> [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
>> [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
>> [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
>> [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
>> [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
>> [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
>> [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
>> [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
>> [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
>> [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
>> [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
>> [<ffffffff810141ca>] child_rip+0xa/0x20
>> [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
>> [<ffffffff810141c0>] ? child_rip+0x0/0x20
> 
> You're sure your slave can keep up with this machine ? I've seen cases where things went bad because the other side's IO subsystems where waaaaaaay slower then the masters, so it kept lagging behind, and eventually things broke.

There is no such thing as "lagging behind" in this configuration.
Suspended replication is not configured (unless there's a configuration
snippet that specifies this, and it wasn't posted). If the peer is way
slower, then I/O is eventually just going to block if the peer can't
keep up. Which would, in fact, explain a hung task, but that has little
to do with something "falling behind" something else.

And since we're talking about an administratively disconnected DRBD
here, as Andrew has been saying, anything peer related shouldn't factor
in here at all.

Cheers,
Florian

-- 
Need help with DRBD?
http://www.hastexo.com/knowledge/drbd