[Drbd-dev] Barrier assert failures with latest 8.0 sources
Lars Ellenberg
lars.ellenberg at linbit.com
Mon Jan 21 17:36:38 CET 2008
On Sat, Jan 19, 2008 at 11:40:35AM -0500, Graham, Simon wrote:
> > I'm attempting to run with the latest 8.0 sources from Git (plus a
> > couple of patches - basically the ones I have submitted that have not
> > yet been applied) and am seeing a lot of assert failures in the
> barrier
> > code since the latest change to send barriers as early as possible. A
> > representative trace for a device is attached - you will see that the
> > device gets connected then pauses resync (not sure if this is really
> > relevant) and then we start streaming the assert failures --
> apparently
> > we are off by one barrier from this point on...
>
> Hmm.. maybe not as hard to diagnose as I thought -- when the drbd
> connection is lost, we end up calling tl_clear which clears out the
> transfer list _but_ leaves a single barrier in the list with number 4711
> and req-cnt 0 (so oldest_barrier and newest_barrier both point to this
> pseudo-barrier entry).
>
> When we reconnect and start processing requests again, when the first
> barrier is needed, it will be number 4712 and will get added to the list
> and the BarrierRq will be sent with this number. When the BarrierAck is
> received, oldest_barrier is still 4711 though, leading to the assert
> failure...
>
> I'm not sure why tl_clear leaves this pseudo-barrier in the list...
> shouldn't it simply leave the list completely empty just like tl_init
> does?
probably.
we have seen these ASSERTS, too, btw, also without this latest change in
the barrier code, so aparently it has been there all along.
unfortunately we are all sort of distracted right now.
but coding will resume shortly :)
--
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
More information about the drbd-dev
mailing list