[DRBD-user] Primary node blocked after secondary node becomes diskless

Radu Radutiu rradutiu at gmail.com
Fri Aug 7 09:37:12 CEST 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I had a strange problem yesterday. The I/O on the primary node blocked
after the secondary node had a storage problem and became diskless. There
was no storage problem on the primary node (at least from what I can see
from /var/log/messages). The processes writing to the disk became stuck at
100% iowait and a reboot several hours later would hang as the DRBD device
was held open by the stuck processes.
Has anyone seen this behaviour before? Any idea what can be done to avoid
such problems?

OS: RHEL 6. kernel 2.6.32-431.23.3.el6.x86_64
DRBD version 8.4.5
/var/log/messages on primary node:

Aug  6 09:35:34 NODE01 kernel: block drbd0: Remote failed to finish a
request within ko-count * timeout
Aug  6 09:35:34 NODE01 kernel: block drbd0: peer( Secondary -> Unknown )
conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Aug  6 09:35:34 NODE01 kernel: block drbd0: new current UUID
9523C48040E0780D:1A58D0763AAC64A9:86E5804984394CE1:86E4804984394CE1
Aug  6 09:35:34 NODE01 kernel: drbd repdata: asender terminated
Aug  6 09:35:34 NODE01 kernel: drbd repdata: Terminating drbd_a_repdata
Aug  6 09:35:34 NODE01 kernel: drbd repdata: Connection closed
Aug  6 09:35:34 NODE01 kernel: block drbd0: conn( Timeout -> Unconnected )
Aug  6 09:35:34 NODE01 kernel: block drbd1: peer( Secondary -> Unknown )
conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown )
Aug  6 09:35:34 NODE01 kernel: drbd repdata: receiver terminated
Aug  6 09:35:34 NODE01 kernel: drbd repdata: Restarting receiver thread
Aug  6 09:35:34 NODE01 kernel: drbd repdata: receiver (re)started
Aug  6 09:35:34 NODE01 kernel: drbd repdata: conn( Unconnected ->
WFConnection )
Aug  6 09:35:34 NODE01 kernel: block drbd1: new current UUID
51EBC6BE2F2729CD:9591EC68BC51A519:E4656A33D9A47115:E4646A33D9A47115
Aug  6 09:37:41 NODE01 kernel: drbd repdata: Handshake successful: Agreed
network protocol version 101
Aug  6 09:37:41 NODE01 kernel: drbd repdata: Agreed to support TRIM on
protocol level
Aug  6 09:37:41 NODE01 kernel: drbd repdata: Peer authenticated using 20
bytes HMAC
Aug  6 09:37:41 NODE01 kernel: drbd repdata: conn( WFConnection ->
WFReportParams )
Aug  6 09:37:41 NODE01 kernel: drbd repdata: Starting asender thread (from
drbd_r_repdata [2707])
Aug  6 09:37:41 NODE01 kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless )
Aug  6 09:39:23 NODE01 kernel: INFO: task jbd2/drbd1-8:9509 blocked for
more than 120 seconds.
Aug  6 09:39:23 NODE01 kernel:      Not tainted 2.6.32-431.23.3.el6.x86_64
#1
Aug  6 09:39:23 NODE01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 09:39:23 NODE01 kernel: jbd2/drbd1-8  D 0000000000000001     0
9509      2 0x00000080
Aug  6 09:39:23 NODE01 kernel: ffff88086e7e5c20 0000000000000046
0000000000000000 ffff88086e7e5be4
Aug  6 09:39:23 NODE01 kernel: 0000000000000000 ffff88087fc24400
ffff880028256840 0000000000000400
Aug  6 09:39:23 NODE01 kernel: ffff88086d1f85f8 ffff88086e7e5fd8
000000000000fbc8 ffff88086d1f85f8
Aug  6 09:39:23 NODE01 kernel: Call Trace:
Aug  6 09:39:23 NODE01 kernel: [<ffffffff811bfae0>] ? sync_buffer+0x0/0x50
Aug  6 09:39:23 NODE01 kernel: [<ffffffff81529393>] io_schedule+0x73/0xc0
Aug  6 09:39:23 NODE01 kernel: [<ffffffff811bfb20>] sync_buffer+0x40/0x50
Aug  6 09:39:23 NODE01 kernel: [<ffffffff81529e5f>] __wait_on_bit+0x5f/0x90
Aug  6 09:39:23 NODE01 kernel: [<ffffffff811bfae0>] ? sync_buffer+0x0/0x50
Aug  6 09:39:23 NODE01 kernel: [<ffffffff81529f08>]
out_of_line_wait_on_bit+0x78/0x90
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8109b020>] ?
wake_bit_function+0x0/0x50
Aug  6 09:39:23 NODE01 kernel: [<ffffffff811bfad6>]
__wait_on_buffer+0x26/0x30
Aug  6 09:39:23 NODE01 kernel: [<ffffffffa014e7f1>]
jbd2_journal_commit_transaction+0x1181/0x1500 [jbd2]
Aug  6 09:39:23 NODE01 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Aug  6 09:39:23 NODE01 kernel: [<ffffffff81084a1b>] ?
try_to_del_timer_sync+0x7b/0xe0
Aug  6 09:39:23 NODE01 kernel: [<ffffffffa0153a48>] kjournald2+0xb8/0x220
[jbd2]
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8109afa0>] ?
autoremove_wake_function+0x0/0x40
Aug  6 09:39:23 NODE01 kernel: [<ffffffffa0153990>] ? kjournald2+0x0/0x220
[jbd2]
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Aug  6 09:39:23 NODE01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Aug  6 09:39:23 NODE01 kernel: INFO: task oracle:9573 blocked for more than
120 seconds.
Aug  6 09:39:23 NODE01 kernel:      Not tainted 2.6.32-431.23.3.el6.x86_64
#1


Radu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150807/1751ccb0/attachment.htm>


More information about the drbd-user mailing list