[DRBD-user] DRBD blocked for more than 120 seconds on CentOS 6.0 (FAIL)

Wed Nov 2 13:37:35 CET 2011

On Tue, Nov 01, 2011 at 02:16:54PM +0100, Florian Haas wrote:
> Hi Igmar,
> 
> trying to rectify a couple of misconceptions here, for people pulling
> this thread from the list archives.
> 
> On 2011-11-01 08:00, Igmar Palsenberg wrote:
> >> d-con r1: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
> >> d-con r1: asender terminated
> >> d-con r1: Terminating asender thread
> >> block drbd0: new current UUID 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A
> >> d-con r1: Connection closed
> >> d-con r1: conn( Disconnecting -> StandAlone ) 
> >> d-con r1: receiver terminated
> >> d-con r1: Terminating receiver thread
> > 
> > The syncer connection got disconnected. That's usually a bad sign. Why was the disconnect in the first place ? It might be related to the disk IO failing.
> 
> Really? How so?
> 
> Disconnect applies to the _network_ connection, and to that only. If you
> had an issue with local disk I/O, and you had on-io-error set to detach
> (default in 8.4.0+), then a detach message would be what's expected. Not
> a disconnect.
> 
> As Andrew points out, though, this was evidently an administrative
> disconnect, so not related to any failure at all.
> 
> >> INFO: task drbd_w_r1:4599 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> drbd_w_r1     D ffff880818639900     0  4599      2 0x00000080
> >> ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0
> >> ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c
> >> ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638
> >> Call Trace:
> >> [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
> >> [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80
> >> [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd]
> >> [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40
> >> [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd]
> >> [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c]
> >> [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd]
> >> [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd]
> >> [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd]
> >> [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd]
> >> [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd]
> >> [<ffffffff810141ca>] child_rip+0xa/0x20
> >> [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd]
> >> [<ffffffff810141c0>] ? child_rip+0x0/0x20
> > 
> > You're sure your slave can keep up with this machine ? I've seen cases where things went bad because the other side's IO subsystems where waaaaaaay slower then the masters, so it kept lagging behind, and eventually things broke.
> 
> There is no such thing as "lagging behind" in this configuration.
> Suspended replication is not configured (unless there's a configuration
> snippet that specifies this, and it wasn't posted). If the peer is way
> slower, then I/O is eventually just going to block if the peer can't
> keep up. Which would, in fact, explain a hung task, but that has little
> to do with something "falling behind" something else.
> 
> And since we're talking about an administratively disconnected DRBD
> here, as Andrew has been saying, anything peer related shouldn't factor
> in here at all.

As I pointed out earlier:
> The traces you provide suggest that DRBD is waiting for completion                                                                                                                                             
> of IO (meta data transactions, in this case) to the local disk,                                                                                                                                                
> which for some reason does not happen.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com