<div dir="ltr">Hi drbd developers,<div><br></div><div>After some research and tests I feel I found the reason of this problem and a possible fix in drbd.</div><div>Would you please check if my theory is correct?</div><div><br></div><div><br></div><div>Let me use 8.4.6 as the code base when I explain it.</div><div>When conn_disconnect hang it is hanging at line drbd_receiver.c:5178</div><div>static int drbd_disconnected(struct drbd_peer_device *peer_device)<br></div><div>{</div><div>........</div><div>wait_event(device-&gt;misc_wait, !test_bit(BITMAP_IO, &amp;device-&gt;flags));<br></div><div>}</div><div><br></div><div>The reason is device has flag BITMAP_IO set.</div><div><br></div><div><br></div><div>The reason why flag BITMAP_IO is set and not clear is:</div><div>Disk state changes when network is disconnected and after_state_ch is called.</div><div><br></div><div>At drbd_state.c line 1949 drbd_queue_bitmap_io is called inafter_state_ch() .</div><div><br></div><div>I think the real reason is in drbd_queue_bitmap_io. drbd_main.c line 3641.<br></div><div><div>void drbd_queue_bitmap_io(struct drbd_device *device,</div><div><span class="" style="white-space:pre">                        </span>  int (*io_fn)(struct drbd_device *),</div><div><span class="" style="white-space:pre">                        </span>  void (*done)(struct drbd_device *, int),</div><div><span class="" style="white-space:pre">                        </span>  char *why, enum bm_flag flags)</div><div>{</div><div><span style="white-space:pre">.........</span></div><div><span class="" style="white-space:pre">        </span>set_bit(BITMAP_IO, &amp;device-&gt;flags);</div><div><span class="" style="white-space:pre">        </span>if (atomic_read(&amp;device-&gt;ap_bio_cnt) == 0) {</div><div><span class="" style="white-space:pre">                </span>if (!test_and_set_bit(BITMAP_IO_QUEUED, &amp;device-&gt;flags))</div><div><span class="" style="white-space:pre">                        </span>drbd_queue_work(&amp;first_peer_device(device)-&gt;connection-&gt;sender_work,</div><div><span class="" style="white-space:pre">                                        </span>&amp;device-&gt;bm_io_work.w);</div><div><span class="" style="white-space:pre">        </span>}</div><div><span style="white-space:pre">........</span></div><div>}</div></div><div><br></div><div>In the code the only code to clear BITMAP_IO is in device-&gt;bm_io_work.w(w_bitmap_io). But when atomic_read(&amp;device-&gt;ap_bio_cnt) != 0 the flag BITMAP_IO  is set, however bm_io_work.w is not called.</div><div>Then drbd_disconnected() is blocked.</div><div><br></div><div>Should we move set_bit(BITMAP_IO, &amp;device-&gt;flags) to the front of drbd_queue_work()?</div><div><br></div><div><br></div><div>Thanks a lot!</div><div><br></div><div>Fang</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 9, 2015 at 12:56 PM, Fang Sun <span dir="ltr">&lt;<a href="mailto:sunf2002@gmail.com" target="_blank">sunf2002@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi drbd team,<div><br></div><div>I am running into a drbd problem recently and I hope I can get some help from you.</div><div><br></div><div>This problem can be reproduced in 8.4.4,8.4.5 and 8.4.6.</div><div><br></div><div>I have a 2 nodes cluster. There are 2 disks. One disk is upToDate and the other is syncing.</div><div>I cut the network on standby when one disk is syncing. </div><div>I configured fencing=resource-and-stonith, and I expect my drbd fencing is called when network is shutdown. This always works as expected if both disk are UpToDate when I shutdown network on standby.</div><div>But when one disk is syncing this caused the drbd to suspend both disks and drbd fencing isn&#39;t called. And I can see drbd read process is put into D state.</div><div><br></div><div>Some logs on primary:</div><div><p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: drbd cic: PingAck did not arrive in time.</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: block drbd1: conn( SyncSource -&gt; NetworkFailure
)</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: block drbd2: conn( Connected -&gt; NetworkFailure )
pdsk( UpToDate -&gt; DUnknown )</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: drbd cic: peer( Secondary -&gt; Unknown ) susp( 0
-&gt; 1 )</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: drbd cic: susp( 0 -&gt; 1 )</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: drbd cic: asender terminated</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:53:49 shrvm219 kernel: drbd cic: Terminating drbd_a_cic</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:red">&gt;&gt;&gt;There
isn’t Connection closed</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
6 16:57:28 shrvm220 kernel: INFO: task xfsalloc/0:805 blocked for more than 120
seconds.</span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black"> </span></p>

<p class="MsoNormal" style="margin-left:0.5in"><span style="color:black">Apr 
3 20:57:36 shrvm220 kernel: INFO: task xfsaild/drbd2:16778 blocked for more
than 120 seconds.</span></p><p class="MsoNormal" style="margin-left:0.5in"><br></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">[root@shrvm219 ~]# ps -ax | grep drbd</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">Warning: bad syntax, perhaps a bogus &#39;-&#39;? See
/usr/share/doc/procps-3.2.8/FAQ</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">5683 ?       
S      0:00 [drbd-reissue]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">11386 pts/0   
S+     0:00 grep drbd</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12098 ?       
S      0:00 [drbd_submit]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12103 ?       
S      0:00 [drbd_submit]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12118 ?       
S      0:02 [drbd_w_cic]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12139 ?       
D      0:03 [drbd_r_cic]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12847 ?       
S      0:00 [xfsbufd/drbd2]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12848 ?      
 S      0:00 [xfs-cil/drbd2]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12849 ?       
D      0:00 [xfssyncd/drbd2]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">12850 ?       
S      0:02 [xfsaild/drbd2]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">13359 ?       
S      0:00 [xfsbufd/drbd1]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">13360 ?       
S      0:00 [xfs-cil/drbd1]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">13361 ?       
D      0:00 [xfssyncd/drbd1]</span></p><p class="MsoNormal" style="margin-left:0.5in">





























</p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;">13362 ?  
     S      0:02
[xfsaild/drbd1]</span></p><p class="MsoNormal"><span style="font-size:9pt;font-family:&#39;Lucida Console&#39;"><br></span></p><p class="MsoNormal"><span style="color:rgb(31,73,125)">This is part of kernel stack, as
I thought drbd is stuck at conn_disconnect. </span></p><p class="MsoNormal"><span style="color:rgb(31,73,125)"> </span></p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: SysRq : Show Blocked
State</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: 
task                       
PC stack   pid father</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: <span style="color:rgb(192,0,0)">drbd_r_cic    </span>D
0000000000000000     0 12139 
    2 0x00000084</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: ffff88023886fd90
0000000000000046 0000000000000000 ffff88023886fd54</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: ffff88023886fd20
ffff88023fc23040 000002b2f338c2ce ffff8800283158c0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: 00000000000005ff
000000010028bb73 ffff88023ab87058 ffff88023886ffd8</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: Call Trace:</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109ee2e&gt;] ? prepare_to_wait+0x4e/0x80</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa0394d4d&gt;] conn_disconnect+0x22d/0x4f0 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: [&lt;ffffffff8109eb00&gt;]
? autoremove_wake_function+0x0/0x40</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa0395120&gt;] drbd_receiver+0x110/0x220 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa03a69e0&gt;] ? drbd_thread_setup+0x0/0x110 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa03a6a0d&gt;] drbd_thread_setup+0x2d/0x110 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa03a69e0&gt;] ? drbd_thread_setup+0x0/0x110 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109e66e&gt;] kthread+0x9e/0xc0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8100c20a&gt;] child_rip+0xa/0x20</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109e5d0&gt;] ? kthread+0x0/0xc0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8100c200&gt;] ? child_rip+0x0/0x20</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: <span style="color:rgb(192,0,0)">xfssyncd/drbd </span>D
0000000000000001     0 12849     
2 0x00000080</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: ffff880203387ad0
0000000000000046 0000000000000000 ffff880203387a94</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: ffff880203387a30
ffff88023fc23040 000002b60ef7decd ffff8800283158c0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: 0000000000000400
000000010028ef9d ffff88023921bab8 ffff880203387fd8</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: Call Trace:</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa039c947&gt;] drbd_make_request+0x197/0x330 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109eb00&gt;] ? autoremove_wake_function+0x0/0x40</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81270810&gt;] generic_make_request+0x240/0x5a0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa039cbe9&gt;] ? drbd_merge_bvec+0x109/0x2a0 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81270be0&gt;] submit_bio+0x70/0x120</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81064b90&gt;] ? default_wake_function+0x0/0x20</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa0208bba&gt;] _xfs_buf_ioapply+0x16a/0x200 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01ef50a&gt;] ? xlog_bdstrat+0x2a/0x60 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa020a87f&gt;] xfs_buf_iorequest+0x4f/0xe0 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01ef50a&gt;] xlog_bdstrat+0x2a/0x60 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01f0ce9&gt;] xlog_sync+0x269/0x3e0 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01f0f13&gt;] xlog_state_release_iclog+0xb3/0xf0 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01f13a2&gt;] _xfs_log_force+0x122/0x240 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa01f1688&gt;] xfs_log_force+0x38/0x90 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa0214a02&gt;] xfs_sync_worker+0x52/0xa0 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa021491e&gt;] xfssyncd+0x17e/0x210 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa02147a0&gt;] ? xfssyncd+0x0/0x210 [xfs]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109e66e&gt;] kthread+0x9e/0xc0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8100c20a&gt;] child_rip+0xa/0x20</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: [&lt;ffffffff8109e5d0&gt;]
? kthread+0x0/0xc0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8100c200&gt;] ? child_rip+0x0/0x20</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel<span style="color:rgb(192,0,0)">:
xfssyncd/drbd </span>D 0000000000000001     0
13361      2 0x00000080</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: ffff8802033edad0
0000000000000046 0000000000000000 ffff8802033eda94</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: 0000000000000000
ffff88023fc23040 000002b60ef68ea3 ffff8800283158c0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: 00000000000007fe
000000010028ef9d ffff880239c41058 ffff8802033edfd8</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel: Call Trace:</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa039c947&gt;] drbd_make_request+0x197/0x330 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff8109eb00&gt;] ? autoremove_wake_function+0x0/0x40</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81270810&gt;] generic_make_request+0x240/0x5a0</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffffa039cbe9&gt;] ? drbd_merge_bvec+0x109/0x2a0 [drbd]</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81270be0&gt;] submit_bio+0x70/0x120</p><p class="MsoNormal">















































































































</p><p class="MsoNormal">Apr  7 19:34:08 shrvm219 kernel:
[&lt;ffffffff81064b90&gt;] ? default_wake_function+0x0/0x20</p><p class="MsoNormal"><br></p><p class="MsoNormal">Thanks a lot in advance.</p><span class="HOEnZb"><font color="#888888"><p class="MsoNormal"><br></p><p class="MsoNormal">Fang</p></font></span></div></div>
</blockquote></div><br></div>