[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Lars Marowsky-Bree lmb at suse.de
Sat Sep 13 04:13:55 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On 2008-09-12T23:55:53, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

Trying to explain again.

> situation 1:
> 	primary crash.

-> secondary receives "peer is stopped (fenced)" notification, clears
   outdated flag

> 	secondary has to take over,

-> secondary promotes fine

> 	so it better not mark itself outdated.
> situation 2:
> 	replication link breaks

-> Pacemaker doesn't do anything, because it doesn't know ;-)

(Actually, to drbd, it doesn't know if the link broke or the secondary
is indeed down)

-> Primary marks itself as "outdated" for now, freezes IO
   (As you don't like me to say that it is outdated, because this seems
   to invoke the current meaning instead of the new behaviour, maybe I
   should call it "marks itself as 'in flux'"? I'm open to using
   terminology which is more clear.)

> 	primary wants to continue serving data.

-> primary calls out to mark the peer as failed
-> peer (secondary) is stopped by pacemaker, or fenced (if the machine
hung, crashed, whatever)

> 	so secondary must mark itself outdated.

-> Secondary is "outdated" by virtue of not having received one of the
signals that cleared the flag

-> Primary receives "peer is stopped" notification, clears flag, and
   continues saving data

> 	otherwise on a later primary crash heartbeat would try to make
> 	it primary and succeed in going online with stale data.
> 	that DID HAPPEN.
> 	that is why dopd was invented in the first place.

Right, and I don't think it can happen with this scheme.

> variation:
> 	as it may be a cluster partition.
> 	with stonith, (at least) one of the nodes gets shot.

That is actually identical to either one of the above scenarios, I
think, depending on which side wins.

Only the surviving side will receive all the right steps to continue
serving data.

> 	primary must freeze until peer is confirmed outdated (or shot)

It'd still call out to try and fail the peer; but as that is impossible
(peer is unreachable), it'll instead receive the fencing notification.

> 	and must unfreeze again as soon as peer is confirmed outdated (or shot)

Or the primary is shot; could go either way, but that would look like
scenario 1.


Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

More information about the drbd-user mailing list