[DRBD-user] Multiple Clusters

Fri Jun 4 15:22:29 CEST 2004

/ 2004-06-04 11:39:32 +0200
\ Dominique Chabord:
> >> So, instead of having a Primary array I would have 5,6 or 7 "Primary"
> >> arrays
> >> that mirror each other across an isolated network (10.0.0.0 say).
> >
> >
> > Not each to each.
> >
> > You can build a "chain":
> >
> > Situation 1:
> >
> >          Node1      Node2      Node3
> >          -----      -----      -----
> > drbd1    Primary    Secondary  stopped
> >
> > Then switch to Situation 2:
> >
> >          Node1      Node2      Node3
> >          -----      -----      -----
> > drbd1    stopped    Primary    Secondary
> 
> I'll keep this message, you explain very clearly what the situation is.
> >
> > This sort of switchover will take time, because Node3 needs to sync with
> > Node2. In case of a full sync of e.g. 100 GB this will need _hours_.
> 
> Good point.
> So, what would be your recommendation here ?
> 
> I see three possible cases in case Node1 fails:
> -CASE1 I think I can repair Node1, and resynch it with updates only 
> after repair. Therefore I decide not to synchronize Node3 as a secondary
> -CASE2 I think I cannot repair Node1, or it then will require a full 
> sync anyway. Therefore I decide to synchronize Node3 as a secondary

Node1 was Primary. It crashed.
-=> regardless of what we do, we always needs a full sync.

> -CASE3 I don't know why Node1 is down. I start sync to Node3 as soon as 
> possible, then I see what I can get from Node1. If I repair it shortly, 
> then, I'll stop synching Node3 and resynch Node1. I made noise for 
> nothing during this period of time.

not useful. as mentioned, we need full sync anyways,
and restarting a full sync just starts from the very beginning.

so if a device in Primary state fails, you should just switch its
Secondary to Primary, then pick a "free" Idle to be the new Secondary,
and do a full sync.
when the crashed node comes back, it should replace the former Idle,
waiting for the next bad thing to happen.

and, of coruse, but we talked about that earlier,
since the meta data of drbd is strictly peer to peer, you mos likely
have to whipe it out whenever a device goes into Idle, since the next
thing it will do is become a new Secondary and need a full sync.

I) role transitions are then strictly
 a Primary -> crashed.
 b crashed Pri -> Idle (wiped out meta data)
 c Idle -> Secondary, Resynchronize the full device
 d Secondary -> Primary (when Primary fails, and I am fully synced)

to detect (b), on boot, right when I would configure drbd, I check for
its metadata. if in the metadata the Primary indicator is set, then I
know it crashed (since if it was shutdown cleanly, it had been switched
to sceondary first).
so if on boot the primary indicator of a device is set, I go into
'idle', first wiping out the metadata, then *not* configuring the
device, waiting for instruction by (wdx ? whomever...) to become a new
Secondary (c).

there are corner cases where this won't exactly be the best thing to do,
but it would be a safe thing to do.
and of course, the cluster manager needs to transactionally track in
persistent data which nodes are/have been peers.

clean cluster shutdown and reboot is an other interessting case that
needs to be thought through to avoid unneccessary full syncs.

II) the other set of transitions is (and here we actualy have a choice):
 a) Secondary -> crashed.
 b.1) crashed Sec -> Secondary, quick sync
		(only if previous peer is current peer and still is
		Primary and still has the same "generation count")
 b.2) crashed Sec -> Idle
		(because some other Idle did already replace me)

> Shaman-X today implementation is CASE3: start sync automatically and 
> change secondary manually.

hopefully without forcefully disrupting a running full synch.
there is no point in doing so.

> Maybe we should think about a human decision to implement CASE1.

as mentioned, won't work.

> We might also give a chance to auto-repair Node1, something like:
> if Node1 is not back before 10mn, or if Node1 is unstable and failed 
> three times in the last 24h, etc...  then we go automated and we sync a 
> new secondary.

only usefull if the failed one was Secondary.

BTW, for the new meta data and persistent bitmap in drbd 0.7, it would
be even more important to wipe out meta data, when a crashed former
Primary boots and the former peer already is working as Primary with a
new Secondary peer.

because even though in the intended peer to peer setup, the activity log
and persistent bitmap allows for a "fast full sync" (only possibly
changed blocks), in the scenario where we relocate peers over several
nodes, we need a "real full sync" (whole device).

so with drbd 0.7, you really ought to have some appropriate timeouts
before deciding to sync to a new secondary, leaving the former
secondary, now active Primary, in wait-for-resurrection-of-former-peer,
because when you reallocate a new secondary you need the "real full
sync", when the crashed former primary comes back, a "fast full sync"
will be enough.

	Lars Ellenberg