Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I had a strange problem yesterday. The I/O on the primary node blocked after the secondary node had a storage problem and became diskless. There was no storage problem on the primary node (at least from what I can see from /var/log/messages). The processes writing to the disk became stuck at 100% iowait and a reboot several hours later would hang as the DRBD device was held open by the stuck processes. Has anyone seen this behaviour before? Any idea what can be done to avoid such problems? OS: RHEL 6. kernel 2.6.32-431.23.3.el6.x86_64 DRBD version 8.4.5 /var/log/messages on primary node: Aug 6 09:35:34 NODE01 kernel: block drbd0: Remote failed to finish a request within ko-count * timeout Aug 6 09:35:34 NODE01 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Aug 6 09:35:34 NODE01 kernel: block drbd0: new current UUID 9523C48040E0780D:1A58D0763AAC64A9:86E5804984394CE1:86E4804984394CE1 Aug 6 09:35:34 NODE01 kernel: drbd repdata: asender terminated Aug 6 09:35:34 NODE01 kernel: drbd repdata: Terminating drbd_a_repdata Aug 6 09:35:34 NODE01 kernel: drbd repdata: Connection closed Aug 6 09:35:34 NODE01 kernel: block drbd0: conn( Timeout -> Unconnected ) Aug 6 09:35:34 NODE01 kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) Aug 6 09:35:34 NODE01 kernel: drbd repdata: receiver terminated Aug 6 09:35:34 NODE01 kernel: drbd repdata: Restarting receiver thread Aug 6 09:35:34 NODE01 kernel: drbd repdata: receiver (re)started Aug 6 09:35:34 NODE01 kernel: drbd repdata: conn( Unconnected -> WFConnection ) Aug 6 09:35:34 NODE01 kernel: block drbd1: new current UUID 51EBC6BE2F2729CD:9591EC68BC51A519:E4656A33D9A47115:E4646A33D9A47115 Aug 6 09:37:41 NODE01 kernel: drbd repdata: Handshake successful: Agreed network protocol version 101 Aug 6 09:37:41 NODE01 kernel: drbd repdata: Agreed to support TRIM on protocol level Aug 6 09:37:41 NODE01 kernel: drbd repdata: Peer authenticated using 20 bytes HMAC Aug 6 09:37:41 NODE01 kernel: drbd repdata: conn( WFConnection -> WFReportParams ) Aug 6 09:37:41 NODE01 kernel: drbd repdata: Starting asender thread (from drbd_r_repdata [2707]) Aug 6 09:37:41 NODE01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless ) Aug 6 09:39:23 NODE01 kernel: INFO: task jbd2/drbd1-8:9509 blocked for more than 120 seconds. Aug 6 09:39:23 NODE01 kernel: Not tainted 2.6.32-431.23.3.el6.x86_64 #1 Aug 6 09:39:23 NODE01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 6 09:39:23 NODE01 kernel: jbd2/drbd1-8 D 0000000000000001 0 9509 2 0x00000080 Aug 6 09:39:23 NODE01 kernel: ffff88086e7e5c20 0000000000000046 0000000000000000 ffff88086e7e5be4 Aug 6 09:39:23 NODE01 kernel: 0000000000000000 ffff88087fc24400 ffff880028256840 0000000000000400 Aug 6 09:39:23 NODE01 kernel: ffff88086d1f85f8 ffff88086e7e5fd8 000000000000fbc8 ffff88086d1f85f8 Aug 6 09:39:23 NODE01 kernel: Call Trace: Aug 6 09:39:23 NODE01 kernel: [<ffffffff811bfae0>] ? sync_buffer+0x0/0x50 Aug 6 09:39:23 NODE01 kernel: [<ffffffff81529393>] io_schedule+0x73/0xc0 Aug 6 09:39:23 NODE01 kernel: [<ffffffff811bfb20>] sync_buffer+0x40/0x50 Aug 6 09:39:23 NODE01 kernel: [<ffffffff81529e5f>] __wait_on_bit+0x5f/0x90 Aug 6 09:39:23 NODE01 kernel: [<ffffffff811bfae0>] ? sync_buffer+0x0/0x50 Aug 6 09:39:23 NODE01 kernel: [<ffffffff81529f08>] out_of_line_wait_on_bit+0x78/0x90 Aug 6 09:39:23 NODE01 kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50 Aug 6 09:39:23 NODE01 kernel: [<ffffffff811bfad6>] __wait_on_buffer+0x26/0x30 Aug 6 09:39:23 NODE01 kernel: [<ffffffffa014e7f1>] jbd2_journal_commit_transaction+0x1181/0x1500 [jbd2] Aug 6 09:39:23 NODE01 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 Aug 6 09:39:23 NODE01 kernel: [<ffffffff81084a1b>] ? try_to_del_timer_sync+0x7b/0xe0 Aug 6 09:39:23 NODE01 kernel: [<ffffffffa0153a48>] kjournald2+0xb8/0x220 [jbd2] Aug 6 09:39:23 NODE01 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 Aug 6 09:39:23 NODE01 kernel: [<ffffffffa0153990>] ? kjournald2+0x0/0x220 [jbd2] Aug 6 09:39:23 NODE01 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0 Aug 6 09:39:23 NODE01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Aug 6 09:39:23 NODE01 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Aug 6 09:39:23 NODE01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Aug 6 09:39:23 NODE01 kernel: INFO: task oracle:9573 blocked for more than 120 seconds. Aug 6 09:39:23 NODE01 kernel: Not tainted 2.6.32-431.23.3.el6.x86_64 #1 Radu -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150807/1751ccb0/attachment.htm>