[DRBD-user] question about start drbd on single node after a power outage

Tue Jan 31 15:04:49 CET 2012

On 01/31/2012 08:51 AM, Xing, Steven wrote:
> Thanks a lot, Kaloyan.
> Could you give me more detail about the way you mentioned for make the previous primary automatically promote even the disk status are "Consistent", even it is not safe.
> Do I need write some script or just need change some drbd settings?
> Thanks again. That would be very helpful.

Elaborating on what Kaloyan said;

You will be running the risk of a split-brain situation which can lead
to data loss. It is highly ill advised to automate the promotion of a
Consistent node to UpToDate. It is much more wise to instead avoid the
situation in the first place.

The problem is that, in clustering, there is an idea that "The only
thing you don't know is what you don't know." When the old primary
recovers, it can't know what happened to it's peer in the time that it
was offline.

As Kaloyan said, if the secondary had been promoted to primary then the
old primary will have an old view of the data. If you force it to
UpToDate and start writing data to it, and between the time of the fault
the backup had been made primary and had data written to it, you now
have good data on both nodes that is out of sync. The only option to
recover is to discard the changes on one of the nodes, hence, data loss.

With this said;

If the DRBD resource is part of a cluster proper, like pacemaker or
rhcs, then you can tie DRBD's fencing into the cluster using
crm-fence-peer.sh or obliterate-peer.sh/rhcs_fence on pacemaker or rhcs,
respectively. Setup DRBD to be a resource of the cluster and then set
the cluster to fence it's peer when it starts, if it doesn't respond
when the cluster starts (assuming a 2-node cluster).

This will work because the cluster will start and (power) fence the
peer. Assuming the peer isn't dead, just off, the peer should boot up.
The node will then start DRBD which will start waiting for it's peer.
Meanwhile, the peer is booting and should come online, join the cluster
and start DRBD. As soon as it does, the old Primary will know that it
really is UpToDate and start up safely.

If you want to force the issue though, you can use 'wfc-timeout 300'
which will tell DRBD to wait up to 5 minutes for it's peer. After that
time, consider itself primary. Please don't use this though until you've
exhausted all other ways of starting safely.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Papers and Projects: https://alteeve.com