[DRBD-user] DRBD Failover Not Working after Cold Shutdown of Primary

Fri Dec 14 21:51:09 CET 2007

Thanks for the reply. Funny enough, I just noticed your message to the
list just before I joined (by checking the archives). The solution
proposed worked for me as well:

try configuring a shorter timeout for drbd-peer-outdater  like so:
	outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";

However, now I am seeing a different problem that is definitely not
heartbeat-related. If I shut down the secondary node only (node2)
while node1's DRBD is primary, then cat /proc/drbd on node1 shows the
following expected state:

version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by buildsvn at c5-x8664-build, 2007-12-08 01:01:10
 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated C r---

Now node1 is running as a "degraded cluster," so I expect that when I
reboot node1 (leaving node2 shut down), the DRBD startup will respect
the config setting for degr-wfc-timeout:

degr-wfc-timeout 120;    # 2 minutes.

However, when I reboot node1 it waits forever as if it is using the
regular wfc-timeout value (which is 0, for infinite wait).

According to the docs for degr-wfc-timeout:

    # Wait for connection timeout if this node was a degraded cluster.
    # In case a degraded cluster (= cluster with only one node left)
    # is rebooted, this timeout value is used.

So, why does DRBD on node1 think that it was not in a degraded state
when it was rebooted?

Thanks again,

Sam

> Little more info please. (drbd.conf, ha.cf and connection path info)
>
>If you are using "fencing resource-only;" try commenting out and
>unplugging the power.  That would at least help you narrow down where
>the problem is.  This one caused similar problems for me.

>On Fri, 2007-12-14 at 10:52 -0800, Art Age Software wrote:
> Hi all,
>
> I've been working on setting up a two-node HA cluster with Heartbeat + DRBD.
> Everything has been going well and now I am in the testing phase.
> All initial tests looked good. Manual failover via heartbeat commands
> works as expected.
> Rebooting the machines works as expected with resources transitioning correctly.
> Finally, it was time for the biggest test of all...physically pulling
> power on the pirmary.
> Now, this is a painful thing to have to do - but of course necessary.
> So, naturally I had hoped it would "just work" on the first try.
> Unfortunately, it did not. DRBD became primary on node2, but the file
> system never mounted.
> >From looking at the logs, DRBD on node2 seems to have aborted
> Heartbeat's attempt at takeover. I have copied a relvant portion of
> the log below.
> Can anybody offer any insight into what went wrong? Clearly, it isn't
> a highly available system if node2 won't take over in the event that
> node1 disappears.
>
> DRBD vers