[Drbd-dev] Barrier assert failures with latest 8.0 sources

Graham, Simon Simon.Graham at stratus.com
Wed Jan 23 14:53:26 CET 2008


> I do feel that the situations when this can arise should be incredibly
> rare - the connection has to fail between issuing the first request
and
> the time it completes... and yet I can hit this problem fairly easily,
> so maybe there is some other situation I haven't thought of that
causes
> this path to be taken.
> 

So, I was right to be concerned -- the _actual_ cause of the problem is
as follows:

1. tl_clear first runs through the TL and does a
req_mod(connection_lost)
   This _can_ cause requests to be completed and therefore can end up
   calling queue_barrier because this is the first request in the
current
   barrier. This means that CREATE_BARRIER can be set in the flags.

2. Then tl_clear reinitializes the transfer list but does not clear the
flag.

Thus pretty much anytime you lose the connection whilst there is traffic
in progress you run the risk of setting the flag and thereby setting
yourself up to create a new barrier incorrectly next time you connect.

I think the fixes are:

a. Don't do queue_barrier if the connection state is <Connected
b. Make sure the flag is clear in drbd_connect
c. Modify w_send_barrier to decrement the ap count if it decides to not
   send a barrier (e.g. if the connection is lost between the time the
   work item is queued and the time it runs).

Will submit patch once I've tested this...
Simon


More information about the drbd-dev mailing list