[DRBD-user] question about start drbd on single node after a power outage

Tue Jan 31 16:47:45 CET 2012

Your response both faster, ;)
Thanks all. I understand that there is a high risk of split brain and data loss if force one node boot up.
But in my case, service recover is much more important than data loss.

I have 2 node pacemaker cluster, drbd running as master/slave(controlled by pacemaker).
When power outage on both nodes(just like unplug the power cable from both node at the same time)
Seems no chance for drbd to do anything(such as fence the peer), thus, after I just boot up the previous primary node, 
I saw the pacemaker trying to promote the drbd to master, but failed, since if I run drbdadm status, it shows:

<drbd-status version="8.3.11" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="vksshared" cs="WFConnection" ro1="Secondary" ro2="Unknown" ds1="Consistent" ds2="DUnknown" />
</resources>
</drbd-status>

I tried set both of the wfc time out to 5sec, that not work.

As you can see the service drbd started, but can not be promote, since ds1="Consistent", only "UpToDate" will work.
Only when I boot up the other node, just after the 2 drbd instance connected, drbd can declare disk status as "UpToDate".

If not boot up the other node, I have not find an automatic way no matter it is safe or not to force it think itself is "UpToDate".
Seems the drbd did not remember its previous running status(primary or slave) for some safe reason.

Do you have any idea/comments on this? I looked into the doc, could not find any setting can make this done even not safe. 
If I upgrade drbd to the latest version, will it help?

-----Original Message-----
From: Kaloyan Kovachev [mailto:kkovachev at varna.net] 
Sent: January-31-12 9:35 AM
To: Digimer
Cc: Xing, Steven; drbd-user at lists.linbit.com
Subject: Re: [DRBD-user] question about start drbd on single node after a power outage

You were faster than me :)

On Tue, 31 Jan 2012 09:04:49 -0500, Digimer <linux at alteeve.com> wrote:
> 
> If you want to force the issue though, you can use 'wfc-timeout 300'
> which will tell DRBD to wait up to 5 minutes for it's peer. After that 
> time, consider itself primary. Please don't use this though until 
> you've exhausted all other ways of starting safely.

There are two (well documented) options in drbd.conf - wfc-timeout and degr-wfc-timeout. To avoid split-brain i set both to 0.

If you need to skip waiting you can manually do this from the console in case you start drbd standalone or before cman / pacemaker.

In my case it is exported via iSCSI (not as cluster resource), so have additional wait loop for both nodes to became UpToDate for _all_ configured resources before exporting any of them - 'no data' is better than 'broken data' - yes i have been bitten from the last one (luckily during the preproduction phase) and believe me you don't wan't that on production nodes (unless you have static read-only data)