[DRBD-user] Why not keep track of peer outdated on up node?

Tue Mar 10 17:42:16 CET 2009

--- On Tue, 3/10/09, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> > However, if you keep track of your peer's failure,
> > this restriction is potentially removed.
> 
> We keep track of the peer being "outdated",
> if it is.

Cool!

> > If node
> > B suffers an HD failure and you are replacing its
> > drive, do you want your cluster to require manual
> > boot intervention if node A happens to go down
> > for a minute?
> 
> if it "goes down" _unexpectedly_,
> it will be a crashed primary,
> and use the "degr-wfc-timeout".
> which is finite by default.
> no manual intervention needed.

But, is that not a risky suggestion?  Will both node A and B
in the above scenario start on their own (without being
able to connect to their peer) after "degr-wfc-timeout"?
If so, then for node A it would be a safe solution, but not
for node B since it may already be outdated causing split
brain, no?!

> if it is _shut down_ cleanly, explicitly,
> then well, you had manual intervention anyways.
> 
> that is currently the reasoning behind how we deal
> with wfc-timeout, and degr-wfc-timeout.

OK, but neither of those situations allow a single node
to start safely automatically on its own currently, do 
they?

> > Seems unnecessary to me, node A
> > should be smart enough to be able to reboot and
> > continue on its own?
> 
> but yes, we do consider to not wait at all _iff_ we find
> the "peer is outdated or worse" flag in the meta data. 
> the flag is already there.

I think that would be a very valuable option making a
cluster much more HA, especially with low end commodity
hardware and non professional setups where it might not
be uncommon for machines to go down.

> > Well, it certainly can be handled on the cluster level
> > (and I plan on doing so), but why would drbd not want
> > to store extra important information if possible?
> 
> it already does.

Cool, but how can a cluster manager get this info then?
I tried using drbdadm dstate and could not see a difference
in this case, am I missing something?

> it just does not (yet) use it to skip the wait-for-connection 
> completely. this can probably be changed. this has some more 
> implications though, which we are discussing.

I am not surprised, but I could not think of any.  I am 
curious about what you think they are, could you elaborate?

> > Even if drbd does not use this info, why not store the
> > fact that you are positive that your peer is outdated
> 
> we already do.

Again, cool, how do I get that info from a script?

> > (or that you are in split brain)?!
> 
> hm.  we probably could.  but what would we then do with
> that information?  just display it?

Yes, for starters, that would make better/smarter 
integration with heartbeat possible.  This data could 
become a node attribute that could then become a 
constraint which allows a node to be promoted to master 
even if it cannot connect to its peer.

Thanks, for you consideration on this.  I run drbd
for my home servers, nothing that really needs HA, I 
just like the ability to failover when I want to maintain
a machine.  Being that my data is important to me, but
HA is a secondary goal, it is not uncommon for me to
operate for a while with a node down.  This means that
I am probably much more prone to failure scenarios then 
the average "serious" setup.  That being the case, I am
more aware of the current drawbacks to the current 
failure handling.  This one has burned me before.

Another thing that makes my setup more prone to 
encountering this problem is the asymmetry of my cluster.
I have only one of the nodes on a UPS, this means that if
power goes out, my usual secondary will drop out right
away (node B).  But since my UPS has a limited backuptime,
if the power outage is long, the primary will eventually
also go down (node A).  Now, when power comes back on, you 
would think that I would be fine, but another "feature"
of much commodity hardware are soft power switches which
do not always work right.  Despite a BIOS setting that
supposedly makes my backup computer (node B) able to power 
on by itself, it will not actually power on without manual
intervention.  So, after a long enough power outage, node
A will return on its own unattended while node B will not.
This leaves me with a cluster that is down that could be
up.

I know that I describe a lot of things above that no one
in their right mind would want to do if they were serious
about HA.  However, I also believe that those who are
serious about HA are less likely to actually make their
clusters fail deliberately in various ways for testing
only.  This means that they too might have hidden 
scenarios which could cause more downtime to them then
they anticipate.  I hope that my (and other's) soft HA 
attempts will expose more corner cases that drbd could
eventually handle better and become more robust than 
other HA solutions!

Thanks, for listening to my blabbing...

-Martin