Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi List, I am using DRBD 8.3.6 along with Linux Kernel Version 2.6.32, in my environment i have used an iSCSI device(ext3) on my secondary as the backup device. When i run a test-case which does a synchronous writes on primary mounted partition(ext3), At the same time if the network is down on iSCSI Host i experience a hang on primary for a span of ~120 seconds. Testcase on Primary: " while true; do date | tee -a /mnt/drbd1/c.dat; echo -n A ; sync ; echo -n B ; sleep 1 ; echo C ; done " Initial analysis pointed us to the Ext3 layer where we observed a hang, below is the sequence, journal_commit_transaction -> wait_for_iobuf -> wait_on_buffer < *gets stuck here* > wait_on_buffer -> buffer locked -> wait_on_bit -> sync_buffer -> io_schedule When we debugged it further we understood that we were waiting for a callback to be received from drbd driver, submit_bh: callback for bh = journal_end_buffer_io_sync callback for bio = end_bio_bh_io_sync ( calls journal_end_buffer_io_sync ) submit_bh -> register callback for bio (buffer io) end_bio_bh_io_sync -> submit_bio -> generic_make_request -> __generic_make_request -> q->make_request_fn -> corresponding handle for drbd is called which is drbd_make_request_26, When we debugged it further in drbd driver and the iscsi driver we understood that when n/w is down, iSCSI layer goes to a blocked state for time equivalent to the session recovery timeout value which default to 120 sec. On Secondary, Operations from <scsi_io_completion> to <asender through wake_asender in drbd_endio_write_sec> does not happen when the iscsi is in blocked state and hence the callback to the ext3 layer does not happen on the Primary which waits on a wait queue to receive a P_RECV_ACK from secondary. Attached the complete call trace for reference. I back-ported a set of patches from 8.3 branch, major ones being the below, Complete list is available as part of back-ported patches listed in the attached text file. all patches listed for drbd: detach from frozen backing device & drbd: Implemented real timeout checking for request processing time I can still see the issue with the back-ported patches, so we made some changes to the drbd driver wherein if there is no response from the peer we try to trigger a timeout and subsequently a state change. I have attached the patch for reference. Can anyone please suggest if the attached patch is the right way of resolving the issue? Thanks & Regards, Mukunda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: Trigger_timeout.patch Type: text/x-patch Size: 3015 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: call-trace Type: application/octet-stream Size: 4672 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: Backported Patches for my_env Type: application/octet-stream Size: 5161 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150629/abb6bfa2/attachment-0001.obj>