[DRBD-user] Why not keep track of peer outdated on up node?

Wed Mar 11 11:31:37 CET 2009

On Tue, Mar 10, 2009 at 09:42:16AM -0700, Martin Fick wrote:
> 
> --- On Tue, 3/10/09, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> > > However, if you keep track of your peer's failure,
> > > this restriction is potentially removed.
> > 
> > We keep track of the peer being "outdated",
> > if it is.
> 
> Cool!
> 
> 
> > > If node
> > > B suffers an HD failure and you are replacing its
> > > drive, do you want your cluster to require manual
> > > boot intervention if node A happens to go down
> > > for a minute?
> > 
> > if it "goes down" _unexpectedly_,
> > it will be a crashed primary,
> > and use the "degr-wfc-timeout".
> > which is finite by default.
> > no manual intervention needed.
> 
> But, is that not a risky suggestion?  Will both node A and B
> in the above scenario start on their own (without being
> able to connect to their peer) after "degr-wfc-timeout"?
> If so, then for node A it would be a safe solution, but not
> for node B since it may already be outdated causing split
> brain, no?!

if it knows it is outdated,
it will refuse to become primary.

> > if it is _shut down_ cleanly, explicitly,
> > then well, you had manual intervention anyways.
> > 
> > that is currently the reasoning behind how we deal
> > with wfc-timeout, and degr-wfc-timeout.
> 
> OK, but neither of those situations allow a single node
> to start safely automatically on its own currently, do 
> they?

what is "safely".

> > > Seems unnecessary to me, node A
> > > should be smart enough to be able to reboot and
> > > continue on its own?
> > 
> > but yes, we do consider to not wait at all _iff_ we find
> > the "peer is outdated or worse" flag in the meta data. 
> > the flag is already there.
> 
> I think that would be a very valuable option making a
> cluster much more HA, especially with low end commodity
> hardware and non professional setups where it might not
> be uncommon for machines to go down.

if you use crappy hardware,
I deem it a tough assumtion that "HA" clustering will make the
overall system more available.

and, as mentioned, I read "machines going down" as "crash.
that is already handled via the "crashed primary" detection
and the resulting use of degr-wfc-timeout.

> Cool, but how can a cluster manager get this info then?
> I tried using drbdadm dstate and could not see a difference
> in this case, am I missing something?

I don't think that the cluster manager should know anything about
the "crashed primary" flag of DRBD.
we have already issues with crm and drbd outsmarting each other.

and what exactly would the cluster manager do with the
information that this was a drbd primary before the crash?

the whole "wfc-timeout" stuff is actually intended for a setup where
drbd is configured (and does this wfc-timeout) _before_ any cluster
manager is started and would consider to promote anyone.

> > > Even if drbd does not use this info, why not store the
> > > fact that you are positive that your peer is outdated
> > 
> > we already do.
> 
> Again, cool, how do I get that info from a script?

dstate

you need to configure "fencing"
to something other than "dont-care".

> > > (or that you are in split brain)?!
> > 
> > hm.  we probably could.  but what would we then do with
> > that information?  just display it?
> 
> Yes, for starters, that would make better/smarter 
> integration with heartbeat possible.  This data could 
> become a node attribute that could then become a 
> constraint which allows a node to be promoted to master 
> even if it cannot connect to its peer.

huh?

because we already have data diversion,
it is ok do diverge even further?

I don't follow you there.

> Thanks, for you consideration on this.  I run drbd
> for my home servers, nothing that really needs HA, I 
> just like the ability to failover when I want to maintain
> a machine.  Being that my data is important to me, but
> HA is a secondary goal, it is not uncommon for me to
> operate for a while with a node down.  This means that
> I am probably much more prone to failure scenarios then 
> the average "serious" setup.  That being the case, I am
> more aware of the current drawbacks to the current 
> failure handling.  This one has burned me before.

in short,
you want DRBD to protect you against operator error,
even after suffering from multiple failures already,
without being able to communicate with its peer.
that is pretty hard to implement ;)

> Another thing that makes my setup more prone to 
> encountering this problem is the asymmetry of my cluster.
> I have only one of the nodes on a UPS, this means that if
> power goes out, my usual secondary will drop out right
> away (node B).  But since my UPS has a limited backuptime,
> if the power outage is long, the primary will eventually
> also go down (node A).  Now, when power comes back on, you 
> would think that I would be fine, but another "feature"
> of much commodity hardware are soft power switches which
> do not always work right.  Despite a BIOS setting that
> supposedly makes my backup computer (node B) able to power 
> on by itself, it will not actually power on without manual
> intervention.  So, after a long enough power outage, node
> A will return on its own unattended while node B will not.
> This leaves me with a cluster that is down that could be
> up.

with current DRBD:
finite (say, 15 min) wfc-timeout.
finite (say   1 min) degr-wfc-timeout.
done.

> I know that I describe a lot of things above that no one
> in their right mind would want to do if they were serious
> about HA.  However, I also believe that those who are
> serious about HA are less likely to actually make their
> clusters fail deliberately in various ways for testing
> only.

why that?  if you are serious about HA, of course you
test your clusters response to failure scenarios
before going in production.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed