Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks for the reply. Funny enough, I just noticed your message to the list just before I joined (by checking the archives). The solution proposed worked for me as well: try configuring a shorter timeout for drbd-peer-outdater like so: outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; However, now I am seeing a different problem that is definitely not heartbeat-related. If I shut down the secondary node only (node2) while node1's DRBD is primary, then cat /proc/drbd on node1 shows the following expected state: version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn at c5-x8664-build, 2007-12-08 01:01:10 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated C r--- Now node1 is running as a "degraded cluster," so I expect that when I reboot node1 (leaving node2 shut down), the DRBD startup will respect the config setting for degr-wfc-timeout: degr-wfc-timeout 120; # 2 minutes. However, when I reboot node1 it waits forever as if it is using the regular wfc-timeout value (which is 0, for infinite wait). According to the docs for degr-wfc-timeout: # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. So, why does DRBD on node1 think that it was not in a degraded state when it was rebooted? Thanks again, Sam > Little more info please. (drbd.conf, ha.cf and connection path info) > >If you are using "fencing resource-only;" try commenting out and >unplugging the power. That would at least help you narrow down where >the problem is. This one caused similar problems for me. >On Fri, 2007-12-14 at 10:52 -0800, Art Age Software wrote: > Hi all, > > I've been working on setting up a two-node HA cluster with Heartbeat + DRBD. > Everything has been going well and now I am in the testing phase. > All initial tests looked good. Manual failover via heartbeat commands > works as expected. > Rebooting the machines works as expected with resources transitioning correctly. > Finally, it was time for the biggest test of all...physically pulling > power on the pirmary. > Now, this is a painful thing to have to do - but of course necessary. > So, naturally I had hoped it would "just work" on the first try. > Unfortunately, it did not. DRBD became primary on node2, but the file > system never mounted. > >From looking at the logs, DRBD on node2 seems to have aborted > Heartbeat's attempt at takeover. I have copied a relvant portion of > the log below. > Can anybody offer any insight into what went wrong? Clearly, it isn't > a highly available system if node2 won't take over in the event that > node1 disappears. > > DRBD vers