[DRBD-user] ocf:linbit:drbd: DRBD Split-Brain not detected in non standard setup

Sat Feb 25 08:32:27 CET 2017

On 25 Feb 2017 3:32 am, "Dr. Volker Jaenisch" <volker.jaenisch at inqbus.de>
wrote:

Servus !
Am 24.02.2017 um 15:53 schrieb Lars Ellenberg:

On Fri, Feb 24, 2017 at 03:08:04PM +0100, Dr. Volker Jaenisch wrote:

If both 10Gbit links fail then the bond0 aka the worker connection fails
and DRBD goes - as expected - into split brain. But that is not the problem.

DRBD will be *disconnected*, yes.

Sorry, was not precise in my wording. But I assumed that after going into
disconnect state the Cluster manager is informed and reflects this somehow.
I now noticed that a CIB rule is set on the former primary to stay primary
(please have a look at the cluster state at the end of this email.) but I
still wonder why this is not reflected in the crm status. I was misled by
this missing status information and concluded wrongly, that the
ocf:linbit:drbd plugin does not inform the CRM/CIB. Sorry, for blaiming
drbd.

But I am still confused about the behavior of pacemaker in not reflecting
the change of DRBD in the crm status. Maybe this question should go to the
pacemaker list.

But no reason for it to be "split brain"ed yet.
and with proper fencing configured, it won't.

This is our DRBD config. This is all quite basic:

resource r0 {

  disk {
    fencing resource-only;
  }

This needs to be:

fencing resource-and-stonith;

if you wish drbd to tell crm to take any action.

  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

  on mail1 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.8:7789;
    meta-disk internal;
  }
  on mail2 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.9:7789;
    meta-disk internal;
  }
}

*What did we miss?* We have no Stonith configured, yet. And IMHO a missing
stonith configuration should not interfere with the DRBD-state change. Or
am I wrong with this assumption?

State after bond0 goes down:

root at mail1:/home/volker# crm status
Stack: corosync
Current DC: mail2 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Feb 24 16:56:44 2017          Last change: Fri Feb 24
16:45:19 2017 by root via cibadmin on mail2

2 nodes and 7 resources configured

Online: [ mail1 mail2 ]

Full list of resources:

 Master/Slave Set: ms_drbd_mail [drbd_mail]
     Masters: [ mail2 ]
     Slaves: [ mail1 ]
 Resource Group: FS_IP
     fs_mail    (ocf::heartbeat:Filesystem):    Started mail2
     vip_193.239.30.23  (ocf::heartbeat:IPaddr2):       Started mail2
     vip_172.27.250.7   (ocf::heartbeat:IPaddr2):       Started mail2
 Resource Group: Services
     postgres_pg2       (ocf::heartbeat:pgsql): Started mail2
     Dovecot    (lsb:dovecot):  Started mail2

Failed Actions:
* vip_172.27.250.7_monitor_30000 on mail2 'not running' (7): call=55,
status=complete, exitreason='none',
    last-rc-change='Fri Feb 24 16:47:07 2017', queued=0ms, exec=0ms

root at mail2:/home/volker# drbd-overview
 1:r0/0  StandAlone Primary/Unknown UpToDate/Outdated /shared/data ext4
916G 12G 858G 2%

root at mail1:/home/volker# drbd-overview

 1:r0/0  WFConnection Secondary/Unknown UpToDate/DUnknown

And after bringing up bond0 again the same state on both machines.
After cleanup of the failed VIP interface still the same state:

root at mail2:/home/volker# crm status
Stack: corosync
Current DC: mail2 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Feb 24 17:01:05 2017          Last change: Fri Feb 24
16:59:32 2017 by hacluster via crmd on mail2

2 nodes and 7 resources configured

Online: [ mail1 mail2 ]

Full list of resources:

 Master/Slave Set: ms_drbd_mail [drbd_mail]
     Masters: [ mail2 ]
     Slaves: [ mail1 ]
 Resource Group: FS_IP
     fs_mail    (ocf::heartbeat:Filesystem):    Started mail2
     vip_193.239.30.23  (ocf::heartbeat:IPaddr2):       Started mail2
     vip_172.27.250.7   (ocf::heartbeat:IPaddr2):       Started mail2
 Resource Group: Services
     postgres_pg2       (ocf::heartbeat:pgsql): Started mail2
     Dovecot    (lsb:dovecot):  Started mail2

root at mail2:/home/volker# drbd-overview
 1:r0/0  StandAlone Primary/Unknown UpToDate/Outdated /shared/data ext4
916G 12G 858G 2%

root at mail1:/home/volker# drbd-overview
 1:r0/0  WFConnection Secondary/Unknown UpToDate/DUnknown

After issuing a

mail2# drbdadm connect all

the nodes resync and everything is in best order (The "sticky" rule is
cleared also).

Cheers,

Volker

General Setup : Stock Debian Jessie without any modifications. DRBD,
Pacemaker etc. all Debian.

Here our crm config:

node 740030984: mail1 \
        attributes standby=off
node 740030985: mail2 \
        attributes standby=off
primitive Dovecot lsb:dovecot \
        op monitor interval=20s timeout=15s \
        meta target-role=Started
primitive drbd_mail ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=15s role=Master \
        op monitor interval=16s role=Slave \
        op start interval=0 timeout=240s \
        op stop interval=0 timeout=100s
...
ms ms_drbd_mail drbd_mail \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true is-managed=true target-role=Started
order FS_IP_after_drbd inf: ms_drbd_mail:promote FS_IP:start
order dovecot_after_FS_IP inf: FS_IP:start Services:start
location drbd-fence-by-handler-r0-ms_drbd_mail ms_drbd_mail \
        *rule $role=Master -inf: #uname ne mail2*
colocation mail_fs_on_drbd inf: FS_IP Services ms_drbd_mail:Master
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.15-e174ec8 \
        cluster-infrastructure=corosync \
        cluster-name=mail \
        stonith-enabled=false \
        last-lrm-refresh=1487951972 \
        no-quorum-policy=ignore

-- 
=========================================================
   inqbus Scientific Computing    Dr.  Volker Jaenisch
   Richard-Strauss-Straße 1       +49(08861) 690 474 0
   86956 Schongau-West            http://www.inqbus.de
=========================================================

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170225/aa29751e/attachment.htm>