[DRBD-user] DRBD 8.3.6: Primary hangs for ~120 seconds

Mon Jun 29 08:09:34 CEST 2015

Hi List,

 I am using DRBD 8.3.6 along with Linux Kernel Version 2.6.32, in my
environment i have used an iSCSI device(ext3) on my secondary as the backup
device. When i run a test-case which does a synchronous writes on primary
mounted partition(ext3), At the same time if the network is down on iSCSI
Host i experience a hang on primary for a span of ~120 seconds.

 Testcase on Primary:

"
while true; do date | tee -a /mnt/drbd1/c.dat; echo -n A ; sync ; echo -n B ;
sleep 1 ; echo C ; done
"

Initial analysis pointed us to the Ext3 layer where we observed a hang,
below is the sequence,

journal_commit_transaction -> wait_for_iobuf -> wait_on_buffer < *gets
stuck here* > wait_on_buffer -> buffer locked -> wait_on_bit -> sync_buffer
-> io_schedule

When we debugged it further we understood that we were waiting for a
callback to be received from drbd driver,

submit_bh:

callback for bh = journal_end_buffer_io_sync

callback for bio = end_bio_bh_io_sync ( calls journal_end_buffer_io_sync )

submit_bh ->  register callback for bio (buffer io) end_bio_bh_io_sync ->
submit_bio -> generic_make_request -> __generic_make_request ->
q->make_request_fn -> corresponding handle for drbd is called which is
drbd_make_request_26,

When we debugged it further in drbd driver and the iscsi driver we
understood that when n/w is down, iSCSI layer goes to a blocked state for
time equivalent to the session recovery timeout value which default to 120
sec. On Secondary, Operations from <scsi_io_completion> to <asender through
wake_asender in drbd_endio_write_sec> does not happen when the iscsi is in
blocked state and hence the callback to the ext3 layer does not happen on
the Primary which waits on a wait queue to receive a P_RECV_ACK from
secondary. Attached the complete call trace for reference.

I back-ported a set of patches from 8.3 branch, major ones being the below,
Complete list is available as part of back-ported patches listed in the
attached text file.

all patches listed for drbd: detach from frozen backing device
&
drbd: Implemented real timeout checking for request processing time

 I can still see the issue with the back-ported patches, so we made some
changes to the drbd driver wherein if there is no response from the peer we
try to trigger a timeout and subsequently a state change. I have attached
the patch for reference. Can anyone please suggest if the attached patch is
the right way of resolving the issue?

Thanks & Regards,
Mukunda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Trigger_timeout.patch
Type: text/x-patch
Size: 3015 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: call-trace
Type: application/octet-stream
Size: 4672 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Backported Patches for my_env
Type: application/octet-stream
Size: 5161 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment-0001.obj>