[Drbd-dev] Barrier assert failures with latest 8.0 sources

Graham, Simon Simon.Graham at stratus.com
Tue Jan 22 03:29:02 CET 2008


> > I'm not sure why tl_clear leaves this pseudo-barrier in the list...
> > shouldn't it simply leave the list completely empty just like
tl_init
> > does?
> 
> probably.
> we have seen these ASSERTS, too, btw, also without this latest change
> in
> the barrier code, so aparently it has been there all along.
> unfortunately we are all sort of distracted right now.
> but coding will resume shortly :)

Well, I realize now that I completely misunderstood again;
newest_barrier represents thenext barrier that will be sent, so of
course there has to be one in the list at all times (and tl_init also
sets up barrier 4711).

I think the problem is that tl_clear does NOT clear the CREATE_BARRIER
bit from mdev->flags - so if we disconnect in the small window between
setting this bit and creating the new barrier, then when we reconnect
and send the first request, we'll end up creating a new barrier before
sending the BarrierRq(4711) (processing the first request that has to go
remote) and I think this gets us into the cycle of always being one
barrier behind the remote system... this would also explain why the
assert is intermittent since you have to disconnect in a small window...

Seem reasonable?
/simgr


More information about the drbd-dev mailing list