[Drbd-dev] Barrier assert failures with latest 8.0 sources
Graham, Simon
Simon.Graham at stratus.com
Wed Jan 23 14:53:26 CET 2008
> I do feel that the situations when this can arise should be incredibly
> rare - the connection has to fail between issuing the first request
and
> the time it completes... and yet I can hit this problem fairly easily,
> so maybe there is some other situation I haven't thought of that
causes
> this path to be taken.
>
So, I was right to be concerned -- the _actual_ cause of the problem is
as follows:
1. tl_clear first runs through the TL and does a
req_mod(connection_lost)
This _can_ cause requests to be completed and therefore can end up
calling queue_barrier because this is the first request in the
current
barrier. This means that CREATE_BARRIER can be set in the flags.
2. Then tl_clear reinitializes the transfer list but does not clear the
flag.
Thus pretty much anytime you lose the connection whilst there is traffic
in progress you run the risk of setting the flag and thereby setting
yourself up to create a new barrier incorrectly next time you connect.
I think the fixes are:
a. Don't do queue_barrier if the connection state is <Connected
b. Make sure the flag is clear in drbd_connect
c. Modify w_send_barrier to decrement the ap count if it decides to not
send a barrier (e.g. if the connection is lost between the time the
work item is queued and the time it runs).
Will submit patch once I've tested this...
Simon
More information about the drbd-dev
mailing list