[DRBD-user] ocf:linbit:drbd: DRBD Split-Brain not detected in non standard setup

Fri Feb 24 17:32:05 CET 2017

Servus !

Am 24.02.2017 um 15:53 schrieb Lars Ellenberg:
> On Fri, Feb 24, 2017 at 03:08:04PM +0100, Dr. Volker Jaenisch wrote:
>> If both 10Gbit links fail then the bond0 aka the worker connection fails
>> and DRBD goes - as expected - into split brain. But that is not the problem.
>
> DRBD will be *disconnected*, yes.
Sorry, was not precise in my wording. But I assumed that after going
into disconnect state the Cluster manager is informed and reflects this
somehow.
I now noticed that a CIB rule is set on the former primary to stay
primary (please have a look at the cluster state at the end of this
email.) but I still wonder why this is not reflected in the crm status.
I was misled by this missing status information and concluded wrongly,
that the ocf:linbit:drbd plugin does not inform the CRM/CIB. Sorry, for
blaiming drbd.

But I am still confused about the behavior of pacemaker in not
reflecting the change of DRBD in the crm status. Maybe this question
should go to the pacemaker list.
> But no reason for it to be "split brain"ed yet.
> and with proper fencing configured, it won't.
This is our DRBD config. This is all quite basic:

resource r0 {

  disk {
    fencing resource-only;
  }

  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

  on mail1 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.8:7789;
    meta-disk internal;
  }
  on mail2 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.9:7789;
    meta-disk internal;
  }
}

*What did we miss?* We have no Stonith configured, yet. And IMHO a
missing stonith configuration should not interfere with the DRBD-state
change. Or am I wrong with this assumption?

State after bond0 goes down:

root at mail1:/home/volker# crm status
Stack: corosync
Current DC: mail2 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Feb 24 16:56:44 2017          Last change: Fri Feb 24
16:45:19 2017 by root via cibadmin on mail2

2 nodes and 7 resources configured

Online: [ mail1 mail2 ]

Full list of resources:

 Master/Slave Set: ms_drbd_mail [drbd_mail]
     Masters: [ mail2 ]
     Slaves: [ mail1 ]
 Resource Group: FS_IP
     fs_mail    (ocf::heartbeat:Filesystem):    Started mail2
     vip_193.239.30.23  (ocf::heartbeat:IPaddr2):       Started mail2
     vip_172.27.250.7   (ocf::heartbeat:IPaddr2):       Started mail2
 Resource Group: Services
     postgres_pg2       (ocf::heartbeat:pgsql): Started mail2
     Dovecot    (lsb:dovecot):  Started mail2

Failed Actions:
* vip_172.27.250.7_monitor_30000 on mail2 'not running' (7): call=55,
status=complete, exitreason='none',
    last-rc-change='Fri Feb 24 16:47:07 2017', queued=0ms, exec=0ms

root at mail2:/home/volker# drbd-overview
 1:r0/0  StandAlone Primary/Unknown UpToDate/Outdated /shared/data ext4
916G 12G 858G 2%

root at mail1:/home/volker#
drbd-overview                                                                                                                                                                     

 1:r0/0  WFConnection Secondary/Unknown UpToDate/DUnknown

And after bringing up bond0 again the same state on both machines.
After cleanup of the failed VIP interface still the same state:

root at mail2:/home/volker# crm status
Stack: corosync
Current DC: mail2 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Feb 24 17:01:05 2017          Last change: Fri Feb 24
16:59:32 2017 by hacluster via crmd on mail2

2 nodes and 7 resources configured

Online: [ mail1 mail2 ]

Full list of resources:

 Master/Slave Set: ms_drbd_mail [drbd_mail]
     Masters: [ mail2 ]
     Slaves: [ mail1 ]
 Resource Group: FS_IP
     fs_mail    (ocf::heartbeat:Filesystem):    Started mail2
     vip_193.239.30.23  (ocf::heartbeat:IPaddr2):       Started mail2
     vip_172.27.250.7   (ocf::heartbeat:IPaddr2):       Started mail2
 Resource Group: Services
     postgres_pg2       (ocf::heartbeat:pgsql): Started mail2
     Dovecot    (lsb:dovecot):  Started mail2

root at mail2:/home/volker# drbd-overview
 1:r0/0  StandAlone Primary/Unknown UpToDate/Outdated /shared/data ext4
916G 12G 858G 2%

root at mail1:/home/volker# drbd-overview
 1:r0/0  WFConnection Secondary/Unknown UpToDate/DUnknown

After issuing a

mail2# drbdadm connect all

the nodes resync and everything is in best order (The "sticky" rule is
cleared also).

Cheers,

Volker

General Setup : Stock Debian Jessie without any modifications. DRBD,
Pacemaker etc. all Debian.

Here our crm config:

node 740030984: mail1 \
        attributes standby=off
node 740030985: mail2 \
        attributes standby=off
primitive Dovecot lsb:dovecot \
        op monitor interval=20s timeout=15s \
        meta target-role=Started
primitive drbd_mail ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=15s role=Master \
        op monitor interval=16s role=Slave \
        op start interval=0 timeout=240s \
        op stop interval=0 timeout=100s
...
ms ms_drbd_mail drbd_mail \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true is-managed=true target-role=Started
order FS_IP_after_drbd inf: ms_drbd_mail:promote FS_IP:start
order dovecot_after_FS_IP inf: FS_IP:start Services:start
location drbd-fence-by-handler-r0-ms_drbd_mail ms_drbd_mail \
        *rule $role=Master -inf: #uname ne mail2*
colocation mail_fs_on_drbd inf: FS_IP Services ms_drbd_mail:Master
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.15-e174ec8 \
        cluster-infrastructure=corosync \
        cluster-name=mail \
        stonith-enabled=false \
        last-lrm-refresh=1487951972 \
        no-quorum-policy=ignore

-- 
=========================================================
   inqbus Scientific Computing    Dr.  Volker Jaenisch
   Richard-Strauss-Straße 1       +49(08861) 690 474 0
   86956 Schongau-West            http://www.inqbus.de
=========================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170224/26adc507/attachment.htm>