[DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

Mon Feb 12 22:59:26 CET 2018

2018-02-09 4:40 GMT-06:00 Lars Ellenberg <lars.ellenberg at linbit.com>:
> On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
>> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg <lars.ellenberg at linbit.com>:
>> > And your config is?
>>
>> resource zimbradrbd {
>
>>         allow-two-primaries;
>
> Why dual primary?
> I doubt you really need that.

I do not need it, zimbra does not support active - active

>
>>         after-sb-1pri discard-secondary;
>
> Here you tell it that,
> if during a handshake after a cluster split brain
> DRBD notices data divergence,
> you want it to automatically resolve the situation
> and discard all changes of the node that is Secondary
> during that handshake, and overwrite it with the data
> of the node that is Primary during that handshake.
>
>> become-primary-on both;
>
> And not even a cluster manager :-(

Here I forgot to mention, that for this function I am using pacemaker
and corosync

>
>> > And you logs say?
>> Feb  5 13:45:29 node-01 kernel:
>> drbd zimbradrbd: conn( Disconnecting -> StandAlone )
>
> That's the uninteresting part (the disconnect).
> The interesting part is the connection handshake.
>
>> > As is, I can only take an (educated) wild guess:
>> >
>> > Do you have no (or improperly configured) fencing,
>>
>> i don't have.
>
> Too bad :-(

some option to do it by software? and not by hardware.

>
>> > and funky auto-resolve policies configured,
>> > after-split-brain ... discard-secondary maybe?
>>
>> > Configuring automatic after-split-brain recovery policies
>> > is configuring automatic data loss.
>> >
>> > So if you don't mean that, don't do it.
>>
>> I am not an expert in drbd, but with the configuration that I have
>> according to the documentation,
>> should not this situation happen, or am I wrong?
>
> Unconditional dual primary directly by init script,
>  no fencing,
>  no cluster manager,
> and policies to automatically chose a victim after data divergence.
>
> You absolutely asked for that situation to happen.
>
> You told DRBD to
> go primary on startup
> ignoring the peer (no cluster manager, no fencing policies)
> and in case it detects data divergence, automatically throw away
> changes on whatever node may be Secondary at that point in time.
> Which is likely the node most recently rebooted.
>
> And that is what happened:
> network disconnected,
> you had both nodes Primary, both could (and likely did)
> keep changing their version of the data.
>
> Some time later, you rebooted (or deconfigured and reconfigured)
> DRBD on at least one node, and during the next DRBD handshake,
> DRBD noticed that both datasets have been changed independently,
> would have refused to connect by default, but you told it to
> automatically resolve that data conflict and discard changes on
> whatever node was Secondary during that handshake.
>
> Apparently the node with the "more interesting" (to you)
> data has been Secondary during that handshake,
> while the other with the "less inetersting" (to you) data
> already? still? has been Primary.
>
> And since you have configured DRBD for automatic data loss,
> you got automatic data loss.
>
> Note: those *example* settings for the auto-resolve policies
> may be ok *IF* you have proper fencing enabled.
>
> With fencing, there won't be data divergence, really,
> but DRBD would freeze IO when it first disconnects,
> the (or both) Primaries (at that time) will call the fence handler,
> the fence handler is expected to hard-reset the peer,
> the "winner" of that shoot-out will resume IO,
> the victim may or may not boot back up,
> NOT promote before it can see the peer,
> and during that next handshake obviously be Secondary.
>
> On the off chance that DRBD still thinks there had been data divergence,
> you may want to help it by automatically throwing away the (hopefully
> useless) changes on the then Secondary in this scenario.
> Because if the fencing did work, the shot node should just be
> "linearly older" than the survivor, and DRBD should do a normal resync.
> Which is why I would probably NOT want to do add those settings,
> but in case there is a data divergence detected, rather
> investigate what went wrong, and make a concious decision.
>
> But it is an option DRBD offers to those that want to enable it,
> and may make sense in some setups.  Mostly for those that value
> "online" over "most recent data", and can not afford
> "administrative intervention" for whatever reason.
>
> Maybe we should add a big red stay-out-here-be-dragons sign
> to both "allow-two-primaries" and the after-split-brain
> auto-recovery policy settings in the guide.

for the moment I have left my configuration in the following way, I
think it is the best scenario, even if I have to intervene manually.

resource zimbradrbd {
    protocol C;
    # DRBD device
    device /dev/drbd0;
    # block device
    disk /dev/hamail/opt;
    meta-disk internal;

syncer {
  rate 40M;

  }
   on node-01.domain.nl {
        # IP address:port
        address 192.168.20.91:7788;
    }
    on node-02.domain.nl {
        address 192.168.20.92:7788;

    }

}

-- 
rickygm

http://gnuforever.homelinux.com