[DRBD-user] Dual-primary to single node

Tue Jan 17 16:54:43 CET 2012

On 01/13/2012 04:59 AM, Luis M. Carril wrote:
> Hello,
> 
>    I´m new to DRBD and I think that I have a mess with some concepts and
> policies.

Welcome! DRBD is a bit different from many storage concepts, so it takes
a bit to wrap your head around. However, careful not to overthink
things... It's fundamentally quite straight forward.

>    I have setup a two node cluster (of virtual machines) with a shared
> volume in dual primary mode with ocfs2 as a basic infrastructure for
> some testings.

Do you have fencing? Dual-primary can not operate safely without a
mechanism for ensuring the state of the remote node.

>    I need that when one of the two nodes goes down the other continues
> working normally (we can assume that the other node never will recover
> again), but when one node fails

The assumption that the other will never return is not a concept that
DRBD can assume. This is where fencing comes in... When a node loses
contact with it's peer, it has no way of knowing what state the remote
node is in. Is it still running, but thinks the local peer is gone? Is
the silent node hung, but might return at some point? Is the remote node
powered off?

The only think you know is what you don't know.

Consider;

Both nodes, had they simply assumed "silence == death", go StandAlone
and Primary. During this time, data is written to either node but that
data is not replicated. Now you have divergent data and the only
mechanism to recover is to invalidate the changes on one of the nodes.
Data loss.

The solution is fencing and resource management, which is what Andreas
meant when he asked about pacemaker vs. rgmanager.

>    the other enter in WFConnection state and the volume is disconnected,
> I have setup the standar set of policies for split brain:
> 
>     after-sb-0pri discard-zero-changes;
>     after-sb-1pri discard-secondary;
>     after-sb-2pri disconnect;
> 
>   Which policy should I use to achieve the desired behaivour (if one
> node fails, the other continue working alone)?
> 
> Regards

Again, as Andreas indicated, this controls the policy when comms are
lost (be it because of a network error, peer dieing/hanging, whatever).
It is by design that a node, after losing it's peer, goes into
WFConnection (waiting for connection). In this state, if/when the peer
recovers (as it often does with power fencing), the peer can
re-establish the connection, sync changes and return to a normal
operating state.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron