Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Oct 19, 2007 at 10:12:44AM -0700, malahal at us.ibm.com wrote: > Lars Ellenberg [lars.ellenberg at linbit.com] wrote: > > On Thu, Oct 18, 2007 at 09:56:24AM -0700, malahal at us.ibm.com wrote: > > > I am trying to implement a better 'master selection' scheme in Linux > > > dm-mirror/LVM2 code. > > > > > > I read that DRBD used to have generation number scheme instead of the > > > current UUID scheme. I read that the old 'generation number' scheme > > > doesn't work for more than 2 nodes. Can some please explain how/why? > > > Would a time stamped generation number help for more than 2 nodes? > > > > > > I understand that a generation number scheme would NOT work for quick > > > resynchronization for more than 2 nodes, but would that help if we were > > > to do full-resynchronization? > > > > did you _read_ the papers at http://www.drbd.org/publications.html? > > > > and, why would you want to go with drbd 0.7 or older? > > you probably cannot maintain it, > > and we won't do that much longer either. > > Thank you for your reply. I am not planning on running DRBD at all. I > just want to borrow ideas from DRBD code and implement them in Linux > LVM Mirror. Many things are common but few are different between > DRBD and Linux LVM Mirror. I am trying to understand, if generation > number itself would be sufficient for Linux LVM Mirror. Yes, I did read > the publications listed there. now see how a little context gives a different message :) sorry for jumping to wrong conclusions... generation counters (basically event counters) alone are not sufficient. the "UUID" data generation tags together with some embeded state flags are much more reliable do detect diverging data sets. but to be honest, I am unhappy with those also. in fact I think the most reliable would be to have "qualified" event counters, so basically "prefix" the generation counters with the "node id" for drbd, or more generically in the context of LVM mirror, with the data storage instance id, which should be a UUID. the problem to solve is: two instances of data storage start to communicate. they now have to determine * whether they are related at all * if they are related (share a common history), * is one a direct decendant of the other? * or do they have a common history, but then undergone different, independent modification, leading to diverging datasets? * in that case, can this be solved automatically based on some configured pre-authorized algorithms to throw away certain transactions under special circumstances? one problem with unqualified "event counters" is, when equivalent events happen independently on different instances, comparing the event/generation counters may suggest that the data was identical, even though it is very much different. whether or not they could be made to work "correctly" within some strict constrains for more than two data instances is not worth it, they already fail miserably with only two data instances when both are exposed independently to an arbitrary sequences of "events". so there is the answer to why generation/event counters are not sufficient: you could miss out on sync, silently making the next switchover/fallback to the other mirror instance corrupt your data; miss the fact that you had diverging data sets, and would have needed manual intervention to solve the conflict; or even sync in the "wrong" direction, throwing away valid transactions, warping your data backwards into the past. having only UUID-tags may lead to (unlikely) clashes, so to be "theoretically correct", you'd need to put in some logic to handle those potential clashes, too. and from just looking at the history sequence of UUIDs, you cannot tell much. when qualifiying event counters with node/instance uuid, you have all you need, you have a telling history, you have no clashes. so would be the way to go, IMO. when data instances decide to sync up, they also decide which side to be sync target, and which to be sync source. they now should generated a new "event id", to base their bitmap against, unless you have a bitmap tracking exactly that pair of data instances, in which case re-tagging the bitmap would be unneccessary. once the sync is done, the sync target will tag its data generation with the data generation tag of the sync source, so they know they are identical again. Note: this "data generation tag" _includes_ the node/instance-id, so the sync target now tags it's data with the "foreign" node/instance-id and event counter. even looking at just the history of such tags, based on the succession of node ids, you'd be able to tell what sync events in what direction have happened, and based on the event counts how many different things have happened independently. did this make sense? -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :