[DRBD-user] [BUG] drbd 0.7.4 reconnect problem after network failure

Steve Purkis steve.purkis at multimap.com
Tue Sep 21 15:19:14 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Sep 21, 2004, at 12:31, Lars Ellenberg wrote:

> / 2004-09-21 08:48:51 +0100
> \ Steve Purkis:
>> Hi all,
>> It seems DRBD 0.7.4 cannot recover from a network failure.
> nonsense.
> see below.

[snip explanation]

Ta for the explanation.  To summarize, the problem is that when the 
primary's NIC is disconnected, the secondary takes over and DRBD ends 
up in a split-brain state.  Even though the only modifications done on 
the original primary are to de-activate the device, it has still 
changed independently of the current primary.  DRBD (correctly) notices 
the discrepancy, and bails out to avoid nasty conflicts.

After thinking through things, I still think that from a layman's point 
of view this is a functional bug -- ideally drbd should recognize the 
fact that it's in this state, and work around it ;-).  But now I 
appreciate the difficulties.  Stonith is an option, yes, but one I'd 
prefer not to use if I can avoid it.

>> (a) Assume out of sync after network failure ...
> nope. won't work. but see mentioned discussion thread...

A shame; I'll try & poke around the drbd-dev & linux-ha-dev archives, 
ta for the pointer.

> 	we are going to provide a config mechanism somewhen, where
> 	one can configure that the node with less modification will
> 	be chosen, or the current primary will be chosen, or that
> 	... there are many possible ways.

Hmm... I'm quite interested in these options...  it's true that a node 
with less modifications will typically need to be the one that gets 
sync'd.  Might be an idea to let them be rules (ie: if current primary 
AND has more modifications ...).

Thinking out loud...

What about a preventative option:
	a. become secondary & discard modifications on connection loss
	(ok, so that's crap for primaries - forget it)

On that train of thought, a command to discard all changes since last 
connected to peer could be handy?  Something like the 'invalidate', but 
one that doesn't force a complete resync.

Sounds like it's more of a high-level problem when I think about it...  
Maybe better solved at the FS or failover layer.  I can see why human 
interaction here is a good thing.

> the interessting point is what do we _do_ now.
> and I think we are not too bad currently.

I agree ;-)
I'm just trying to give feedback as I learn to help improve drbd.


More information about the drbd-user mailing list