<div dir="ltr"><div><font face="Courier New, Courier, monospace">Hi List,<br>
<br>
</font></div>
<font face="Courier New, Courier, monospace">I am using DRBD 8.3.6
along with Linux Kernel Version 2.6.32, in my environment i have
used an iSCSI device(ext3) on my secondary as the backup device.
When i run a test-case which does a synchronous writes on
primary mounted partition(ext3), At the same time if the network
is down on iSCSI Host i experience a hang on primary for a span
of ~120 seconds.<br>
<br>
</font>
<font face="Courier New, Courier, monospace">Testcase on Primary:</font><br>
<pre><font face="Courier New, Courier, monospace">"
while true; do date | tee -a /mnt/drbd1/c.dat; echo -n A ; sync ; echo -n B ;
sleep 1 ; echo C ; done
"</font></pre>
<font face="Courier New, Courier, monospace">Initial analysis
pointed us to the Ext3 layer where we observed a hang, below is
the sequence,<br>
<br>
journal_commit_transaction -> wait_for_iobuf ->
wait_on_buffer < <b>gets stuck here</b> > wait_on_buffer
-> buffer locked -> wait_on_bit -> sync_buffer ->
io_schedule<br>
</font><br>
<div><font face="Courier New, Courier, monospace">When we debugged
it further we understood that we were waiting for a callback to
be received from drbd driver,<br>
</font><br>
<pre><font face="Courier New, Courier, monospace">submit_bh:
callback for bh = journal_end_buffer_io_sync
</font></pre>
<pre><font face="Courier New, Courier, monospace">callback for bio = end_bio_bh_io_sync ( calls journal_end_buffer_io_sync )
submit_bh -> register callback for bio (buffer io) end_bio_bh_io_sync ->
submit_bio -> generic_make_request -> __generic_make_request ->
q->make_request_fn -> corresponding handle for drbd is called which is
drbd_make_request_26,</font></pre>
<br>
<font face="Courier New, Courier, monospace">When we debugged it
further in drbd driver and the iscsi driver we understood that
when n/w is down, iSCSI layer goes to a blocked state for time
equivalent to the </font><font face="Courier New, Courier,
monospace">session recovery timeout value which default to 120
sec. On Secondary, Operations from <scsi_io_completion> to
<asender through wake_asender in drbd_endio_write_sec>
does not happen when the iscsi is in blocked state and hence the
callback to the ext3 layer does not happen on the Primary which
waits</font>
<font face="Courier New, Courier, monospace">on a wait queue to
receive a P_RECV_ACK from secondary. Attached the complete call
trace for reference.<br>
</font><br>
<font face="Courier New, Courier, monospace">I back-ported a set
of patches from 8.3 branch, major ones being the below, Complete
list is available as part of back-ported patches listed in the
attached text file.<br>
</font><br>
<font face="Courier New, Courier, monospace">all patches listed
for drbd: detach from frozen backing device</font><br>
<font face="Courier New, Courier, monospace">&<br>
drbd: Implemented real timeout checking for request processing
time<br>
</font><font face="Courier New, Courier, monospace"><br>
</font></div>
<div><font face="Courier New, Courier, monospace">I can still see
the issue with the back-ported patches, so we made some changes
to the drbd driver wherein </font><font face="Courier New,
Courier, monospace">if there is no response from the peer </font><font face="Courier New, Courier, monospace">we try to trigger a
timeout and subsequently a state change. I have attached the
patch for reference. Can anyone please suggest if the attached
patch is the right way of resolving the issue?<br>
</font></div>
<font face="Courier New, Courier, monospace"><br>
Thanks & Regards,<br>
Mukunda</font></div>