Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi For a number of years on both DRBD 8.3 and 8.4 I have been seeing a weird bug that I've never reported because it's too difficult to recreate reliably. Today I got something similar and have a stacktrace so thought I'd send it in in the hope that it might point in the right direction to getting this fixed. The situation goes like this: Set up a DRBD device and get it replicating Leave said device in Secondary/Secondary for days/weeks/months (not sure how long it takes to happen) At some future point, run `drbdadm down $resource` on one of the two systems. This now hangs doing nothing forever or until you switch to the other system and run `drbdadm down $resource` there as well at which point both go down and all is ok. In an attempt to stop this happening on one pair of systems where I have a permanently inactive (i.e secondary/secondary) resource waiting for someone to buy me a Windows license so I can install Windows on a VM using that DRBD resource, I left it in primary/secondary even though nothing is using it. Today I went to shutdown -r the primary system and it got as far as stopping DRBD on the system with this resource in primary and it hung and I got the following stacktrace in the logs: Apr 29 12:57:19 xen23 kernel: block drbd8: role( Primary -> Secondary ) Apr 29 12:57:19 xen23 kernel: block drbd8: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Apr 29 13:01:05 xen23 kernel: INFO: task drbdsetup-84:81150 blocked for more than 120 seconds. Apr 29 13:01:05 xen23 kernel: Not tainted 2.6.32-642.13.1.el6.x86_64 #1 Apr 29 13:01:05 xen23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 13:01:05 xen23 kernel: drbdsetup-84 D 0000000000000003 0 81150 81128 0x00000080 Apr 29 13:01:05 xen23 kernel: ffff880c6195b8a8 0000000000000082 ffff880c6195b828 ffffffffa06e6c18 Apr 29 13:01:05 xen23 kernel: ffff880c6042f8d0 00000000000110aa ffff880c67f6e000 0000000000000000 Apr 29 13:01:05 xen23 kernel: 0000000000000001 00000000000110aa ffff880c6723fad8 ffff880c6195bfd8 Apr 29 13:01:05 xen23 kernel: Call Trace: Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e6c18>] ? is_valid_state+0x98/0x500 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e89a7>] _conn_request_state+0x8e7/0xb00 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d1942>] ? drbd_send_command+0x42/0x50 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40 Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e8c0d>] conn_request_state+0x4d/0x80 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d7f12>] conn_try_disconnect+0x22/0x120 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e3366>] drbd_adm_down+0x106/0x240 [drbd] Apr 29 13:01:05 xen23 kernel: [<ffffffff814a8b91>] genl_rcv_msg+0x271/0x340 Apr 29 13:01:05 xen23 kernel: [<ffffffff814a8920>] ? genl_rcv_msg+0x0/0x340 Apr 29 13:01:05 xen23 kernel: [<ffffffff814a7789>] netlink_rcv_skb+0xa9/0xd0 Apr 29 13:01:05 xen23 kernel: [<ffffffff814a86dc>] genl_rcv+0x2c/0x40 Apr 29 13:01:05 xen23 kernel: [<ffffffff814a73af>] netlink_unicast+0x2df/0x330 Apr 29 13:01:05 xen23 kernel: [<ffffffff814a7e13>] netlink_sendmsg+0x2a3/0x3e0 Apr 29 13:01:05 xen23 kernel: [<ffffffff81466c7b>] sock_aio_write+0x19b/0x1c0 Apr 29 13:01:05 xen23 kernel: [<ffffffff8119996a>] do_sync_write+0xfa/0x140 Apr 29 13:01:05 xen23 kernel: [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40 Apr 29 13:01:05 xen23 kernel: [<ffffffff81247d9f>] ? selinux_file_permission+0xbf/0x150 Apr 29 13:01:05 xen23 kernel: [<ffffffff81014b19>] ? read_tsc+0x9/0x10 Apr 29 13:01:05 xen23 kernel: [<ffffffff8123aa66>] ? security_file_permission+0x16/0x20 Apr 29 13:01:05 xen23 kernel: [<ffffffff81199d34>] vfs_write+0x184/0x1a0 Apr 29 13:01:05 xen23 kernel: [<ffffffff8119b156>] ? fget_light_pos+0x16/0x50 Apr 29 13:01:05 xen23 kernel: [<ffffffff8119a7a1>] sys_write+0x51/0xb0 Apr 29 13:01:05 xen23 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b As usual, running drbdadm down $resource on the other side promptly brought this one back to life and the shutdown continued and it rebooted OK afterwards. This was running a slightly out of date CentOS 6 kernel and ELRepo's kmod-drbd84-8.4.9-1.el6.elrepo.x86_64 which should be plain DRBD 8.4.9. It was being shut down to bring it back up on 6.9 with the latest kernel. Oh, and despite the system's name being xen23, it's not running xen. It's a plain CentOS 6.8 system running kvm and several VMs. Hasn't been xen in a long time. Not sure if this might help to diagnose this slightly weird issue. Trevor