[DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

Tue Feb 13 15:12:57 CET 2018

On Mon, Feb 12, 2018 at 03:59:26PM -0600, Ricky Gutierrez wrote:
> 2018-02-09 4:40 GMT-06:00 Lars Ellenberg <lars.ellenberg at linbit.com>:
> > On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
> >> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg <lars.ellenberg at linbit.com>:
> >> > And your config is?
> >>
> >> resource zimbradrbd {
> >
> >>         allow-two-primaries;
> >
> > Why dual primary?
> > I doubt you really need that.
> 
> I do not need it, zimbra does not support active - active

Don't add complexity you don't need.
Don't allow dual-primary, if you *MUST* use it exclusively anyways.

> >>         after-sb-1pri discard-secondary;
> >
> > Here you tell it that,
> > if during a handshake after a cluster split brain
> > DRBD notices data divergence,
> > you want it to automatically resolve the situation
> > and discard all changes of the node that is Secondary
> > during that handshake, and overwrite it with the data
> > of the node that is Primary during that handshake.
> >
> >> become-primary-on both;
> >
> > And not even a cluster manager :-(
> 
> Here I forgot to mention, that for this function I am using pacemaker
> and corosync

Then don't tell the *init script* to promote DRBD.
If you are using a cluster manager,
controlling DRBD is the job of that cluster manager.

> >> > And you logs say?
> >> Feb  5 13:45:29 node-01 kernel:
> >> drbd zimbradrbd: conn( Disconnecting -> StandAlone )
> >
> > That's the uninteresting part (the disconnect).
> > The interesting part is the connection handshake.
> >
> >> > As is, I can only take an (educated) wild guess:
> >> >
> >> > Do you have no (or improperly configured) fencing,
> >>
> >> i don't have.
> >
> > Too bad :-(
> 
> some option to do it by software? and not by hardware.

There is DRBD "fencing policies" and handlers.
There is Pacemaker "node level fencing" (aka stonith).

To be able to avoid DRBD data divergence due to cluster split-brain,
you'd need both.  Stonith alone is not good enough, DRBD fencing
policies alone are not good enough. You need both.

If you absolutely refuse to use stonith, using at least
DRBD level fencing policies, combined with redundant cluster
communications, are better then not using no fencing at all,
but without stonith, there will still be failure scenarios
(certain cluster split brain scenarios) that will result in data
divergence on DRBD.

Data divergence is not necessarily better than data corruption,
you end up with two versions of data you cannot merge,
at least not in general.
With data corruption (which is the result of cluster split-brain on a
shared disk, without fencing), you at least go straight to your backup;
with replicated disk, and data divergence, you may first waste some time
trying to merge the data sets, before going for the backups anyways :-/

Without "auto-recovery" strategies configured, you at least
get to decide yourself which version to throw away.

DRBD allows you to "go without fencing", but that's just because there
are people who value "being online with some data" (which is potentially
outdated, but at least consistent), over "rather offline when in doubt".

That does not make it a good idea in general, though.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed