[Drbd-dev] Barrier assert failures with latest 8.0 sources

Graham, Simon Simon.Graham at stratus.com
Sat Jan 19 17:40:35 CET 2008


> I'm attempting to run with the latest 8.0 sources from Git (plus a
> couple of patches - basically the ones I have submitted that have not
> yet been applied) and am seeing a lot of assert failures in the
barrier
> code since the latest change to send barriers as early as possible. A
> representative trace for a device is attached - you will see that the
> device gets connected then pauses resync (not sure if this is really
> relevant) and then we start streaming the assert failures --
apparently
> we are off by one barrier from this point on...

Hmm.. maybe not as hard to diagnose as I thought -- when the drbd
connection is lost, we end up calling tl_clear which clears out the
transfer list _but_ leaves a single barrier in the list with number 4711
and req-cnt 0 (so oldest_barrier and newest_barrier both point to this
pseudo-barrier entry). 

When we reconnect and start processing requests again, when the first
barrier is needed, it will be number 4712 and will get added to the list
and the BarrierRq will be sent with this number. When the BarrierAck is
received, oldest_barrier is still 4711 though, leading to the assert
failure...

I'm not sure why tl_clear leaves this pseudo-barrier in the list...
shouldn't it simply leave the list completely empty just like tl_init
does?

Simon


More information about the drbd-dev mailing list