[DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary

Fri Dec 14 22:50:38 CET 2007

On Fri, 2007-12-14 at 12:51 -0800, Art Age Software wrote:
> Thanks for the reply. Funny enough, I just noticed your message to the
> list just before I joined (by checking the archives). The solution
> proposed worked for me as well:
> 
> try configuring a shorter timeout for drbd-peer-outdater  like so:
> 	outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> 
> However, now I am seeing a different problem that is definitely not
> heartbeat-related. If I shut down the secondary node only (node2)
> while node1's DRBD is primary, then cat /proc/drbd on node1 shows the
> following expected state:
> 
> version: 8.0.6 (api:86/proto:86)
> SVN Revision: 3048 build by buildsvn at c5-x8664-build, 2007-12-08 01:01:10
>  0: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated C r---
> 
> Now node1 is running as a "degraded cluster," so I expect that when I
> reboot node1 (leaving node2 shut down), the DRBD startup will respect
> the config setting for degr-wfc-timeout:
> 
> degr-wfc-timeout 120;    # 2 minutes.
> 
> However, when I reboot node1 it waits forever as if it is using the
> regular wfc-timeout value (which is 0, for infinite wait).
> 
> According to the docs for degr-wfc-timeout:
> 
>     # Wait for connection timeout if this node was a degraded cluster.
>     # In case a degraded cluster (= cluster with only one node left)
>     # is rebooted, this timeout value is used.
> 
> So, why does DRBD on node1 think that it was not in a degraded state
> when it was rebooted?

Not sure why yours is not coming back up.  Mine quit waiting after the
120 and but was, in fact, already set to primary while it was waiting.

> 
> Thanks again,
> 
> Sam
> 
> > Little more info please. (drbd.conf, ha.cf and connection path info)
> >
> >If you are using "fencing resource-only;" try commenting out and
> >unplugging the power.  That would at least help you narrow down where
> >the problem is.  This one caused similar problems for me.
> 
> >On Fri, 2007-12-14 at 10:52 -0800, Art Age Software wrote:
> > Hi all,
> >
> > I've been working on setting up a two-node HA cluster with Heartbeat + DRBD.
> > Everything has been going well and now I am in the testing phase.
> > All initial tests looked good. Manual failover via heartbeat commands
> > works as expected.
> > Rebooting the machines works as expected with resources transitioning correctly.
> > Finally, it was time for the biggest test of all...physically pulling
> > power on the pirmary.
> > Now, this is a painful thing to have to do - but of course necessary.
> > So, naturally I had hoped it would "just work" on the first try.
> > Unfortunately, it did not. DRBD became primary on node2, but the file
> > system never mounted.
> > >From looking at the logs, DRBD on node2 seems to have aborted
> > Heartbeat's attempt at takeover. I have copied a relvant portion of
> > the log below.
> > Can anybody offer any insight into what went wrong? Clearly, it isn't
> > a highly available system if node2 won't take over in the event that
> > node1 disappears.
> >
> > DRBD vers
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user