Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> lock up and backtrace:::::::::::::::::::::::::::::::::::::::: > > INFO: task drbd_w_r1:4599 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > drbd_w_r1 D ffff880818639900 0 4599 2 0x00000080 > ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0 > ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c > ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638 > Call Trace: > [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50 > [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80 > [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd] > [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd] > [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c] > [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd] > [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd] > [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd] > [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd] > [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd] > [<ffffffff810141ca>] child_rip+0xa/0x20 > [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd] > [<ffffffff810141c0>] ? child_rip+0x0/0x20 It's waiting for disk IO. What DRBD proto version are you using ? > Initializing cgroup subsys cpuset > Initializing cgroup subsys cpu > Linux version 2.6.32-71.29.1.el6.x86_64 (mockbuild at c6b5.bsys.dev.centos.org) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Mon Jun 27 My advice : Test again with a vanilla kernel. RedHat kernels have tons of stuff backported, so that version number hardly reflects the actual state of the kernel. I've ran into issues with backported features before. > d-con r1: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) > d-con r1: asender terminated > d-con r1: Terminating asender thread > block drbd0: new current UUID 36C25BAEFD88C481:49F8A559F608C6F5:09D91D5543CBB23A:09D81D5543CBB23A > d-con r1: Connection closed > d-con r1: conn( Disconnecting -> StandAlone ) > d-con r1: receiver terminated > d-con r1: Terminating receiver thread The syncer connection got disconnected. That's usually a bad sign. Why was the disconnect in the first place ? It might be related to the disk IO failing. > INFO: task drbd_w_r1:4599 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > drbd_w_r1 D ffff880818639900 0 4599 2 0x00000080 > ffff88080dd93c70 0000000000000046 ffff88081a9b8af8 ffff8800282569f0 > ffff88080dd93bf0 ffffffff81056720 ffff88080dd93c40 000000000000056c > ffff88081a9b8638 ffff88080dd93fd8 0000000000010518 ffff88081a9b8638 > Call Trace: > [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50 > [<ffffffff8109218e>] ? prepare_to_wait+0x4e/0x80 > [<ffffffffa02098e5>] bm_page_io_async+0xe5/0x370 [drbd] > [<ffffffff81091ea0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffffa020b8c2>] bm_rw+0x1a2/0x680 [drbd] > [<ffffffffa0202056>] ? crc32c+0x56/0x7c [libcrc32c] > [<ffffffffa020bdba>] drbd_bm_write_hinted+0x1a/0x20 [drbd] > [<ffffffffa0224602>] _al_write_transaction+0x2c2/0x6a0 [drbd] > [<ffffffffa0224d42>] w_al_write_transaction+0x22/0x50 [drbd] > [<ffffffffa020e85e>] drbd_worker+0x10e/0x480 [drbd] > [<ffffffffa022aa19>] drbd_thread_setup+0xa9/0x160 [drbd] > [<ffffffff810141ca>] child_rip+0xa/0x20 > [<ffffffffa022a970>] ? drbd_thread_setup+0x0/0x160 [drbd] > [<ffffffff810141c0>] ? child_rip+0x0/0x20 You're sure your slave can keep up with this machine ? I've seen cases where things went bad because the other side's IO subsystems where waaaaaaay slower then the masters, so it kept lagging behind, and eventually things broke. Igmar