Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sep 21, 2004, at 12:31, Lars Ellenberg wrote: > / 2004-09-21 08:48:51 +0100 > \ Steve Purkis: >> Hi all, >> >> It seems DRBD 0.7.4 cannot recover from a network failure. > > nonsense. > see below. [snip explanation] Ta for the explanation. To summarize, the problem is that when the primary's NIC is disconnected, the secondary takes over and DRBD ends up in a split-brain state. Even though the only modifications done on the original primary are to de-activate the device, it has still changed independently of the current primary. DRBD (correctly) notices the discrepancy, and bails out to avoid nasty conflicts. After thinking through things, I still think that from a layman's point of view this is a functional bug -- ideally drbd should recognize the fact that it's in this state, and work around it ;-). But now I appreciate the difficulties. Stonith is an option, yes, but one I'd prefer not to use if I can avoid it. >> (a) Assume out of sync after network failure ... > > nope. won't work. but see mentioned discussion thread... A shame; I'll try & poke around the drbd-dev & linux-ha-dev archives, ta for the pointer. > we are going to provide a config mechanism somewhen, where > one can configure that the node with less modification will > be chosen, or the current primary will be chosen, or that > ... there are many possible ways. Hmm... I'm quite interested in these options... it's true that a node with less modifications will typically need to be the one that gets sync'd. Might be an idea to let them be rules (ie: if current primary AND has more modifications ...). Thinking out loud... What about a preventative option: a. become secondary & discard modifications on connection loss (ok, so that's crap for primaries - forget it) On that train of thought, a command to discard all changes since last connected to peer could be handy? Something like the 'invalidate', but one that doesn't force a complete resync. Sounds like it's more of a high-level problem when I think about it... Maybe better solved at the FS or failover layer. I can see why human interaction here is a good thing. > the interessting point is what do we _do_ now. > and I think we are not too bad currently. I agree ;-) I'm just trying to give feedback as I learn to help improve drbd. Cheers, -Steve