Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Nov 01, 2011 at 02:16:54PM +0100, Florian Haas wrote: > Hi Igmar, > > trying to rectify a couple of misconceptions here, for people pulling > this thread from the list archives. > > On 2011-11-01 08:00, Igmar Palsenberg wrote: > >> d-con r1: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) > >> d-con r1: asender terminated > >> d-con r1: Terminating asender thread > >> block drbd0: new current UUID 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A > >> d-con r1: Connection closed > >> d-con r1: conn( Disconnecting -> StandAlone ) > >> d-con r1: receiver terminated > >> d-con r1: Terminating receiver thread > > > > The syncer connection got disconnected. That's usually a bad sign. Why was the disconnect in the first place ? It might be related to the disk IO failing. > > Really? How so? > > Disconnect applies to the _network_ connection, and to that only. If you > had an issue with local disk I/O, and you had on-io-error set to detach > (default in 8.4.0+), then a detach message would be what's expected. Not > a disconnect. > > As Andrew points out, though, this was evidently an administrative > disconnect, so not related to any failure at all. > > >> INFO: task drbd_w_r1:4599 blocked for more than 120 seconds. > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> drbd_w_r1 D ffff880818639900 0 4599 2 0x00000080 > >> ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0 > >> ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c > >> ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638 > >> Call Trace: > >> [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50 > >> [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80 > >> [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd] > >> [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40 > >> [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd] > >> [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c] > >> [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd] > >> [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd] > >> [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd] > >> [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd] > >> [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd] > >> [<ffffffff810141ca>] child_rip+0xa/0x20 > >> [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd] > >> [<ffffffff810141c0>] ? child_rip+0x0/0x20 > > > > You're sure your slave can keep up with this machine ? I've seen cases where things went bad because the other side's IO subsystems where waaaaaaay slower then the masters, so it kept lagging behind, and eventually things broke. > > There is no such thing as "lagging behind" in this configuration. > Suspended replication is not configured (unless there's a configuration > snippet that specifies this, and it wasn't posted). If the peer is way > slower, then I/O is eventually just going to block if the peer can't > keep up. Which would, in fact, explain a hung task, but that has little > to do with something "falling behind" something else. > > And since we're talking about an administratively disconnected DRBD > here, as Andrew has been saying, anything peer related shouldn't factor > in here at all. As I pointed out earlier: > The traces you provide suggest that DRBD is waiting for completion > of IO (meta data transactions, in this case) to the local disk, > which for some reason does not happen. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com