[DRBD-user] n00b question about generation numbers

Sun Oct 21 13:24:17 CEST 2007

On Fri, Oct 19, 2007 at 10:12:44AM -0700, malahal at us.ibm.com wrote:
> Lars Ellenberg [lars.ellenberg at linbit.com] wrote:
> > On Thu, Oct 18, 2007 at 09:56:24AM -0700, malahal at us.ibm.com wrote:
> > > I am trying to implement a better 'master selection' scheme in Linux
> > > dm-mirror/LVM2 code.
> > > 
> > > I read that DRBD used to have generation number scheme instead of the
> > > current UUID scheme. I read that the old 'generation number' scheme
> > > doesn't work for more than 2 nodes. Can some please explain how/why?
> > > Would a time stamped generation number help for more than 2 nodes?
> > > 
> > > I understand that a generation number scheme would NOT work for quick
> > > resynchronization for more than 2 nodes, but would that help if we were
> > > to do full-resynchronization?
> > 
> > did you _read_ the papers at http://www.drbd.org/publications.html?
> > 
> > and, why would you want to go with drbd 0.7 or older?
> > you probably cannot maintain it,
> > and we won't do that much longer either.
> 
> Thank you for your reply. I am not planning on running DRBD at all. I
> just want to borrow ideas from DRBD code and implement them in Linux
> LVM Mirror. Many things are common but few are different between
> DRBD and Linux LVM Mirror. I am trying to understand, if generation
> number itself would be sufficient for Linux LVM Mirror. Yes, I did read
> the publications listed there.

now see how a little context gives a different message :)

sorry for jumping to wrong conclusions...

generation counters (basically event counters) alone are not sufficient.
the "UUID" data generation tags together with some embeded state flags
are much more reliable do detect diverging data sets.
but to be honest, I am unhappy with those also.
in fact I think the most reliable would be to have "qualified" event
counters, so basically "prefix" the generation counters with the "node
id" for drbd, or more generically in the context of LVM mirror, with the
data storage instance id, which should be a UUID.

the problem to solve is:
  two instances of data storage start to communicate.
  they now have to determine
    * whether they are related at all
    * if they are related (share a common history),
      * is one a direct decendant of the other?
      * or do they have a common history,
        but then undergone different, independent modification,
	leading to diverging datasets?
	* in that case, can this be solved automatically
	  based on some configured pre-authorized algorithms to
	  throw away certain transactions under special circumstances?

one problem with unqualified "event counters" is,
when equivalent events happen independently on different instances,
comparing the event/generation counters may suggest
that the data was identical,
even though it is very much different.

whether or not they could be made to work "correctly" within some
strict constrains for more than two data instances is not worth it,
they already fail miserably with only two data instances when
both are exposed independently to an arbitrary sequences of "events".

so there is the answer to why generation/event counters
are not sufficient:
you could miss out on sync,
silently making the next switchover/fallback to the other
mirror instance corrupt your data;
miss the fact that you had diverging data sets,
and would have needed manual intervention to solve the conflict;
or even sync in the "wrong" direction, throwing away valid transactions,
warping your data backwards into the past.

having only UUID-tags may lead to (unlikely) clashes,
so to be "theoretically correct", you'd need to put in some logic to
handle those potential clashes, too. and from just looking at the
history sequence of UUIDs, you cannot tell much.

when qualifiying event counters with node/instance uuid,
you have all you need, you have a telling history,
you have no clashes. so would be the way to go, IMO.

when data instances decide to sync up,
they also decide which side to be sync target,
and which to be sync source.
they now should generated a new "event id",
to base their bitmap against,
unless you have a bitmap tracking exactly that pair of data instances,
in which case re-tagging the bitmap would be unneccessary.

once the sync is done, the sync target will tag its data generation with
the data generation tag of the sync source, so they know they are
identical again.
Note: this "data generation tag" _includes_ the node/instance-id,
so the sync target now tags it's data with the "foreign"
node/instance-id and event counter.

even looking at just the history of such tags,
based on the succession of node ids,
you'd be able to tell what sync events
in what direction have happened,
and based on the event counts
how many different things have happened independently.

did this make sense?

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :