[Drbd-dev] I/O can hang on primary synctarget after an io error.
Ernest.Montrose at stratus.com
Mon Feb 25 22:53:35 CET 2008
We appear to have a009fc907a14f69026b32fbb48a4db6f1cdd5ecd. Reading
your response what I get is that we are guaranteed that if we return
early in drbd_end_write_sec() then someone else would have done the
dec_local near the end or an inc_local was never done?
Hmmm... our testing was not done with the latest git stuff. I will do
some things with the latest.
From: drbd-dev-bounces at linbit.com [mailto:drbd-dev-bounces at linbit.com]
On Behalf Of Lars Ellenberg
Sent: Monday, February 25, 2008 4:07 PM
To: drbd-dev at linbit.com
Subject: Re: [Drbd-dev] I/O can hang on primary synctarget after an io
On Mon, Feb 25, 2008 at 03:31:01PM -0500, Montrose, Ernest wrote:
> Hi all,
> We are seeing an issue where I/O to a volume that received an I/O
> error during re-sync as the sync target hangs. Looking at the logs
> seems that what's going on is that we are skipping a dec_local().
> theory is that after_state_ch() is blocked forever waiting for
> local_cnt to be 0 as we are becoming Diskless. So the worker will
> do any work, hence the hang I/O. Here is the relevant logs:
> Feb 13 03:48:55 node0 kernel: drbd5: Began resync as SyncTarget
> sync 1048508 KB [262127 bits set]).
> Feb 13 03:48:55 node0 kernel: drbd5: Writing meta data super block
> Feb 13 03:48:55 node0 kernel: drbd5: Creating new epoch in
> Feb 13 03:48:55 node0 kernel: drbd5: ***Simulating Resync write
> Feb 13 03:48:56 node0 kernel: drbd5: Resync aborted.
> Feb 13 03:48:56 node0 kernel: drbd5: conn( SyncTarget -> Connected
> disk( Inconsistent -> Failed )
> Feb 13 03:48:56 node0 kernel: drbd5: Local IO failed. Detaching...
> Feb 13 03:48:56 node0 kernel: drbd5: disk( Failed -> Diskless )
> Feb 13 03:48:56 node0 kernel: drbd5: Notified peer that my disk is
> Feb 13 03:48:56 node0 kernel: drbd5: Can not write resync data to
> local disk.
> Feb 13 03:54:57 node0 kernel: drbd5: drbd_nl_disk_conf: mdev->bc
> Notice the last line of the log. Our test environment must have
> to do an "attach" so since local_cnt is not 0 we never freed the
> But from the "Can not write resync data to local disk." we can go
> drbd_endio_write_sec() and there we see a suspicious :
> If(bio->bi_size) return 1;
it's not suspicious. it's "standard procedure".
it even got removed from the internal kernel API recently.
> We are supposed to do the dec_local at the end of drbd_endio_write
> sec(). I am guessing that's where the problem is. But I do not
> why bi_size would be greater then 0. Is the fix simply to
> while returning?
IF there is imbalance in the local refcounting, then elsewhere.
drbd_endio_write_sec is correct, afaics.
do you have this a009fc907a14f69026b32fbb48a4db6f1cdd5ecd
commit included in your code base?
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
drbd-dev mailing list
drbd-dev at lists.linbit.com
More information about the drbd-dev