[Drbd-dev] I/O can hang on primary synctarget after an io error.

Montrose, Ernest Ernest.Montrose at stratus.com
Mon Feb 25 21:31:01 CET 2008


Hi all,
We are seeing an issue where I/O to a volume that received an I/O error
during re-sync as the sync target hangs. Looking at the logs it seems
that what's going on is that we are skipping a dec_local().  My theory
is that after_state_ch() is blocked forever waiting for local_cnt to be
0 as we are becoming Diskless.  So the worker will not do any work,
hence the hang I/O.  Here is the relevant logs:
 
Feb 13 03:48:55 node0 kernel: drbd5: Began resync as SyncTarget (will
sync 1048508 KB [262127 bits set]).
Feb 13 03:48:55 node0 kernel: drbd5: Writing meta data super block now.
Feb 13 03:48:55 node0 kernel: drbd5: Creating new epoch in
drbd_try_rs_begin_io
Feb 13 03:48:55 node0 kernel: drbd5: ***Simulating Resync write failure
Feb 13 03:48:56 node0 kernel: drbd5: Resync aborted.
Feb 13 03:48:56 node0 kernel: drbd5: conn( SyncTarget -> Connected )
disk( Inconsistent -> Failed )
Feb 13 03:48:56 node0 kernel: drbd5: Local IO failed. Detaching...
Feb 13 03:48:56 node0 kernel: drbd5: disk( Failed -> Diskless )
Feb 13 03:48:56 node0 kernel: drbd5: Notified peer that my disk is
broken.
Feb 13 03:48:56 node0 kernel: drbd5: Can not write resync data to local
disk.
Feb 13 03:54:57 node0 kernel: drbd5: drbd_nl_disk_conf: mdev->bc not
NULL.
 
Notice the last line of the log.  Our test environment must have tried
to do an "attach" so since local_cnt is not 0 we never freed the "bc".
 
But from the "Can not write resync data to local disk." we can go to
drbd_endio_write_sec() and there we see a suspicious :
If(bio->bi_size) return 1; 
We are supposed to do the dec_local at the end of drbd_endio_write
sec(). I am guessing that's where the problem is.  But I do not know why
bi_size would be greater then 0.  Is the fix simply to dec_local while
returning?
 
BTW, the code below  inserts the fault in question but won't necessarily
make the hang happens:
 
#!/bin/sh
echo inserting fault....
echo 0x4 >/sys/module/drbd/parameters/enable_faults
echo 0x20 >/sys/module/drbd/parameters/fault_devs
echo 5 > /sys/module/drbd/parameters/fault_rate
echo starting resync
sleep 5
drbdadm -c /etc/drbd.conf.avance invalidate drbd5.vol
echo done....
 
Thanks,
 
EM--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20080225/446d0984/attachment.htm


More information about the drbd-dev mailing list