[DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary

Tue Jan 8 19:25:41 CET 2008

OK, I think I understand. Just to be clear:

If I am running a degraded cluster (say the secondary server is being
replaced and will be unavailable for several days), there are two
possibilities when restarting the primary:

1) The primary crashes and reboots. In this case degr-wfc-timeout is honored.

or

2) The primary is **manually** rebooted (cleanly). In this case
degr-wfc-timeout is **not** honored.

Is this correct?

And if so, what is the intention behind degr-wfc-timeout exactly? Why
would I want to control it separately from wfc-timeout?

I think it would be very handy to have a config setting that says
"This node is a one-node cluster until further notice. So, don't
bother waiting for the peer - don't worry about split-brain - just
start up."

In the scenario I mentioned (having the secondary out for
maintenance), it would be nice if it were not so easy to get into a
situation where you think the server has come up - but your services
are not all there. I can see this happening in a managed hosting
situation where the hosting service reboots the machine for some
reason but is unaware of the DRBD aspect of things.

Thanks,

Sam

On Jan 8, 2008 2:56 AM, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
>
> On Mon, Jan 07, 2008 at 01:16:16PM -0800, Art Age Software wrote:
> > Hi all,
> >
> > I've asked this question before and have still not figured it out.
> >
> > Either  the degr-wfc-timeout setting is not working as documented, or
> > I just don't understand how it is supposed to work.
> >
> > Here's the scenario:
> >
> > 1) Both primary and secondary nodes (servers) are running. DRBD is
> > primary/connected/uptodate on Node1 and secondary/connected/uptodate
> > on Node2.
> >
> > 2) Shut down Node2. This takes DRBD on Node1 into primary/disconnected state.
> >
> > 3) Reboot Node1. (Do **not** start up Node2. It remains shut down.)
> >
> > According to my understanding, what I now have is a "degraded
> > cluster." However, when Node1 reboots, the init script waits forever,
> > ignoring the degr-wfc-timeout setting. It is as if DRBD does not think
> > the cluster is degraded.
> >
> > Another DRBD user on the list has confirmed seeing this behavior as
> > well in his setup.
> >
> > So, is this a DRBD bug? Or am I misunderstanding the use of the
> > degr-wfc-timeout setting?
>
> If I am currently not Primary,
> but meta data primary indicator is set,
> I just now recover from a hard crash,
> and have been Primary before that crash.
>
> Now, if I had no connection before that crash
> (have been degraded Primary), chances are that
> I won't find my peer now either.
>
> In that case, and _only_ in that case,
> we use the degr-wfc-timeout instead of the default,
> so we can automatically recover from a crash of a
> degraded but active "cluster" after a certain timeout.
>
> which means, that if you _reboot_ a degraded node,
> this will not use the "degr-wfc-timeout".
>
> the idea is:
> if you intentionally reboot it, you aparently "logged in" anyways
> (well, reboot will kick you off, but you can immediately log in again).
> maybe you fixed some hardware thing, and the reboot is supposed to
> pick that up. if not, because you are sitting in front of the console
> anyways, you can confirm/kill that wfc-thing if necessary.
>
> if it crashed while being Primary, and then later boots up again,
> it will use degr-wfc-timeout.
>
> --
> : Lars Ellenberg                           http://www.linbit.com :
> : DRBD/HA support and consulting             sales at linbit.com :
> : LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
> : Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
> __
> please use the "List-Reply" function of your email client.
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>