[Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources

Mon Sep 20 17:09:36 CEST 2004

[ I am not subscribed to linux-ha-dev ]

Hi Lars,

[...]
> If we notice that N1 is crashed first, that's fine. Everything will
> happen just as always, and N2 can proceed as soon as it sees the
> post-fence/stop notification, which it will see before being promoted to
> master or even being asked about it.
>
> But, from the point of view of the replicated resource on N2, this is
> indistinguishable from the split-brain; all it knows is that it lost
> connection to it's peer. So it goes on to report this.
>
> If this event occurs before we have noticed a monitoring failure or full
> node failure on N1 and were using the recovery method explained so far,
> we are going to assume an internal split-brain, and tell N2 to mark
> itself outdated, and then try to tell N1 to resume.  Oops. No more
> talky-talky to N1, and we just told N2 it's supposed to refuse to become
> master.

So the algorithm in HB/CRM seems to be:

If I see that resource (drbd) got disconnected from its peer. then {
 If the resource is a replica (secondary) then {
  tell it that it should mark itself as "desync". 
 } else /* Resource is master (primary) */ {
  Wait for the post fence event and thaw the resource.
 }
}

> So, this requires special logic - whenever one incarnation reports an
> internal split-brain, we actively need to go and verify the status of
> the other incarnations first.
>
> In which case we'd notice that, ah, N1 is down or experiencing a local
> resource failure, and instead of outdating N2, would fence / stop N1 and
> then promote N2.
>
> This is the special logic I don't much like. As Rusty put it in his
> keynote, "Fear of complexity" is good for programmers. And this reeks of
> it - extending the monitor semantics, needing an additional command on
> the secondary, _and_ needing to talk to all incarnations and then
> figuring out what to do. (I don't want to think much about partitions
> with >2 resources involved.) Alas, the problem seems to be real.
>

What is about:

If I see that resource (drbd) got disconnected from its peer. then {
 If the resource is a replica (secondary) then {
  /* do nothing */
 } else /* Resource is master (primary) */ {
  Ask the other node to do the fencing.
 }
}

If I see a fence ack then {
 Thaw the resource.
}

There is no special case in there...

> Here's some other alternatives I've thought about which seem simpler,
> but which I then noticed don't solve the problem completely.
>
>
> A) Rely on the internal split-brain timeout being larger than our
> deadtime of N1 and the resource monitoring interval.
>
> This _seems_ to solve it - because then the problematic ordering does
> not occur, but relies quite a bit on timing. And if the resource on N2
> notices, for example, a connection loss immediately, this basically
> can't be made to work. Oh yeah, it can be worked around by adding delays
> etc, but that smells a bit dung-ish, too.
>

Right, ... 
Currently drbd's timeout needs to be smaller than heartbeat's deadtime,
making this the other way round asks for troubles I think...

[...]

BTW, from the text I realized that hearbeat will monitor the resource (drbd).
Probabely with calling the resource script with a new method. Basically
hearbeat polls DRBD for an change in the connection state.

Would you like to have an active notification from DRBD ? 

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :