[DRBD-user] ocf:linbit:drbd: DRBD Split-Brain not detected in non standard setup

Sat Feb 25 18:03:00 CET 2017

Hi All!

OMG I am so stupid!

* Feb 25 16:53:27 mail2 kernel: [11901363.518368] drbd r0: bind before
listen failed, err = -99

Without an interface to bind to drbd cannot listen!

@All: Never do a network failure test with DRBD by shutting down
interfaces in your linux box. Use iptables or pull the cords or shut
down your cisco ports.

So only one thing still stands out:
> But this is not what we like to happen in this case. In the case of
> communication breakdown of DRBD but still a connection between the
> corosync nodes, we would like the cluster nodes :
> 1) to remain in their state,
> 2) prevent DRBD from failover,
> 3) Indicate that the DRBD connection is broken
> 4) wait for reestablishing of the connection and resync the drbd after,
> 5) allow failover again.
From this fairly long wishlist in our case:
1) Works
2) Works (Rule prevents failover)
3) Works not
4) Works
5) Works

I will open another thread for this last issue.

Many thanks for all of you. Sorry for stealing your time

Cheers,
Volker

Just for completeness:

The main problem is still that the drbd does not recover if the
connection is restored:

> But this is not what we like to happen in this case. In the case of
> communication breakdown of DRBD but still a connection between the
> corosync nodes, we would like the cluster nodes :
> 1) to remain in their state,
> 2) prevent DRBD from failover,
> 3) Indicate that the DRBD connection is broken
> 4) wait for reestablishing of the connection and resync the drbd after,
> 5) allow failover again.
>
> From this fairly long wishlist in our case:
> 1) Works
> 2) Works (Rule prevents failover)
> 3) Works not
> 4) Works not
> 5) Works (after manually initiating the reconnect, since 4 does not
> work.)(Rule is removed)

In the log I noticed :
* the fence-peer handler /usr/lib/drbd/crm-fence-peer.sh is called twice
(PID 14189 and 14190)
* PID 14189 states success :

    Feb 25 16:53:27 mail2 crm-fence-peer.sh[14189]: INFO peer is
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-r0-ms_drbd_mail'

* But PID 14190 reports an error and returns with rc=1

Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: WARNING DATA INTEGRITY
at RISK: could not place the fencing constraint!

* Which causes the primary mode going into StandAlone state.

Feb 25 16:53:27 mail2 kernel: [11901363.644874] drbd r0: helper command:
/sbin/drbdadm fence-peer r0 exit code 1 (0x100)
Feb 25 16:53:27 mail2 kernel: [11901363.644880] drbd r0: fence-peer
helper broken, returned 1
Feb 25 16:53:27 mail2 kernel: [11901363.651480] drbd r0: susp( 1 -> 0 )
Feb 25 16:53:28 mail2 kernel: [11901364.516041] drbd r0: State change
failed: Need a connection to start verify or resync
Feb 25 16:53:28 mail2 kernel: [11901364.516128] drbd r0:  mask = 0x1f0
val = 0x80
Feb 25 16:53:28 mail2 kernel: [11901364.516173] drbd r0: 
old_conn:StandAlone wanted_conn:WFConnection
Feb 25 16:53:28 mail2 kernel: [11901364.516225] drbd r0: receiver terminated
Feb 25 16:53:28 mail2 kernel: [11901364.516228] drbd r0: Terminating
drbd_r_r0

* In state StandAlone the Primary node does no longer try to connect to
its peer and therefore no automatic reconnect/split brain recovery ocures.

Questions:
1) Why is the handler called twice?
2) Why does the first call succeed and the second call fail? I assume it
is because the constraint is already set.
3) What is the cause of all this?

Puzzeled,

Volker

Here is the log after issuing an "ifdown bond0" on mail2 :

Feb 25 16:53:27 mail2 kernel: [11901363.472044] drbd r0: PingAck did not
arrive in time.
Feb 25 16:53:27 mail2 kernel: [11901363.472123] drbd r0: peer( Secondary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown ) susp( 0 -> 1 )
Feb 25 16:53:27 mail2 kernel: [11901363.518159] drbd r0: asender terminated
Feb 25 16:53:27 mail2 kernel: [11901363.518163] drbd r0: Terminating
drbd_a_r0
Feb 25 16:53:27 mail2 kernel: [11901363.518224] drbd r0: Connection closed
Feb 25 16:53:27 mail2 kernel: [11901363.518327] drbd r0: conn(
NetworkFailure -> Unconnected )
Feb 25 16:53:27 mail2 kernel: [11901363.518330] drbd r0: receiver terminated
Feb 25 16:53:27 mail2 kernel: [11901363.518332] drbd r0: Restarting
receiver thread
Feb 25 16:53:27 mail2 kernel: [11901363.518334] drbd r0: receiver
(re)started
Feb 25 16:53:27 mail2 kernel: [11901363.518338] drbd r0: helper command:
/sbin/drbdadm fence-peer r0
Feb 25 16:53:27 mail2 kernel: [11901363.518352] drbd r0: conn(
Unconnected -> WFConnection )
Feb 25 16:53:27 mail2 kernel: [11901363.518368] drbd r0: bind before
listen failed, err = -99
Feb 25 16:53:27 mail2 kernel: [11901363.518425] drbd r0: conn(
WFConnection -> Disconnecting )
Feb 25 16:53:27 mail2 kernel: [11901363.518481] drbd r0: Connection closed
Feb 25 16:53:27 mail2 kernel: [11901363.518559] drbd r0: conn(
Disconnecting -> StandAlone )
Feb 25 16:53:27 mail2 kernel: [11901363.518569] drbd r0: helper command:
/sbin/drbdadm fence-peer r0
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14189]: invoked for r0
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: invoked for r0
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14189]: VJ VJ Peer state reachable
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: VJ VJ Peer state reachable
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14189]: INFO peer is reachable,
my disk is UpToDate: placed constraint
'drbd-fence-by-handler-r0-ms_drbd_mail'
Feb 25 16:53:27 mail2 kernel: [11901363.629302] drbd r0: helper command:
/sbin/drbdadm fence-peer r0 exit code 4 (0x400)
Feb 25 16:53:27 mail2 kernel: [11901363.629309] drbd r0: fence-peer
helper returned 4 (peer was fenced)
Feb 25 16:53:27 mail2 kernel: [11901363.629332] drbd r0: pdsk( DUnknown
-> Outdated )
Feb 25 16:53:27 mail2 cib[12663]:  warning: Action cib_create failed:
Name not unique on network (cde=-76)
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures   <failed>
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures    
<failed_update id="drbd-fence-by-handler-r0-ms_drbd_mail"
object_type="rsc_location" operation="cib_create" reason="Name not
unique on network">
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures      
<rsc_location rsc="ms_drbd_mail" id="drbd-fence-by-handler-r0-ms_drbd_mail">
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures        
<rule role="Master" score="-INFINITY"
id="drbd-fence-by-handler-r0-rule-ms_drbd_mail">
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update
failures           <expression attribute="#uname" operation="ne"
value="mail2" id="drbd-fence-by-handler-r0-expr-ms_drbd_mail"/>
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures        
</rule>
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures      
</rsc_location>
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures    
</failed_update>
Feb 25 16:53:27 mail2 cib[12663]:    error: CIB Update failures   </failed>
Feb 25 16:53:27 mail2 cib[12663]:  warning: Completed cib_create
operation for section constraints: Name not unique on network (rc=-76,
origin=mail2/cibadmin/2, version=0.212.0)
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: Call cib_create failed
(-76): Name not unique on network
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: <failed>
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]:   <failed_update
id="drbd-fence-by-handler-r0-ms_drbd_mail" object_type="rsc_location"
operation="cib_create" reason="Name not unique on network">
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]:     <rsc_location
rsc="ms_drbd_mail" id="drbd-fence-by-handler-r0-ms_drbd_mail">
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]:   </failed_update>
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: </failed>
Feb 25 16:53:27 mail2 kernel: [11901363.636902] block drbd1: new current
UUID 48E3C874737F59BD:10B8EB8C0D93D501:30EFCFCA3DAD5E37:30EECFCA3DAD5E37
Feb 25 16:53:27 mail2 crm-fence-peer.sh[14190]: WARNING DATA INTEGRITY
at RISK: could not place the fencing constraint!
Feb 25 16:53:27 mail2 kernel: [11901363.644874] drbd r0: helper command:
/sbin/drbdadm fence-peer r0 exit code 1 (0x100)
Feb 25 16:53:27 mail2 kernel: [11901363.644880] drbd r0: fence-peer
helper broken, returned 1
Feb 25 16:53:27 mail2 kernel: [11901363.651480] drbd r0: susp( 1 -> 0 )
Feb 25 16:53:28 mail2 kernel: [11901364.516041] drbd r0: State change
failed: Need a connection to start verify or resync
Feb 25 16:53:28 mail2 kernel: [11901364.516128] drbd r0:  mask = 0x1f0
val = 0x80
Feb 25 16:53:28 mail2 kernel: [11901364.516173] drbd r0: 
old_conn:StandAlone wanted_conn:WFConnection
Feb 25 16:53:28 mail2 kernel: [11901364.516225] drbd r0: receiver terminated
Feb 25 16:53:28 mail2 kernel: [11901364.516228] drbd r0: Terminating
drbd_r_r0
Feb 25 16:53:39 mail2 drbd(drbd_mail)[14265]: DEBUG: r0: Calling
/usr/sbin/crm_master -Q -l reboot -v 10000
Feb 25 16:53:39 mail2 drbd(drbd_mail)[14265]: DEBUG: r0: Exit code 0
Feb 25 16:53:39 mail2 drbd(drbd_mail)[14265]: DEBUG: r0: Command output:
Feb 25 16:53:54 mail2 drbd(drbd_mail)[14371]: DEBUG: r0: Calling
/usr/sbin/crm_master -Q -l reboot -v 10000
Feb 25 16:53:54 mail2 drbd(drbd_mail)[14371]: DEBUG: r0: Exit code 0
Feb 25 16:53:54 mail2 drbd(drbd_mail)[14371]: DEBUG: r0: Command output:
Feb 25 16:54:09 mail2 drbd(drbd_mail)[14464]: DEBUG: r0: Calling
/usr/sbin/crm_master -Q -l reboot -v 10000
Feb 25 16:54:09 mail2 drbd(drbd_mail)[14464]: DEBUG: r0: Exit code 0

resource r0 {
  disk {
    fencing resource-and-stonith;
  }

  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }

  on mail1 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.8:7789;
    meta-disk internal;
  }
  on mail2 {
    device    /dev/drbd1;
    disk      /dev/sda1;
    address   172.27.250.9:7789;
    meta-disk internal;
  }
}

root at mail2:/home/volker# crm conf show
node 740030984: mail1 \
        attributes standby=off
node 740030985: mail2 \
        attributes standby=off
primitive Dovecot lsb:dovecot \
        op monitor interval=20s timeout=15s \
        meta target-role=Started
primitive drbd_mail ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=15s role=Master \
        op monitor interval=16s role=Slave \
        op start interval=0 timeout=240s \
        op stop interval=0 timeout=100s
primitive fs_mail Filesystem \
        params device="/dev/drbd/by-res/r0" directory="/shared/data"
fstype=ext4 run_fsck=no \
        meta target-role=Started \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s
primitive postgres_pg2 pgsql \
        op monitor interval=30 timeout=30 \
        op start interval=0 timeout=120s \
        op stop interval=0 timeout=120s \
        params pgport=5433 pgctl="/usr/lib/postgresql/9.4/bin/pg_ctl"
psql="/usr/bin/psql" pgdata="/shared/data/dovecot/pg2/data" pgdb=fill
pgdba=postgres config="/etc/postgresql/9.4/pg2/postgresql.conf" \
        meta target-role=Started
primitive vip_172.27.250.7 IPaddr2 \
        params ip=172.27.250.7 cidr_netmask=24 nic=bond0 iflabel=vip \
        op monitor interval=30s \
        meta target-role=Started
primitive vip_193.239.30.23 IPaddr2 \
        params ip=193.239.30.23 iflabel=vip \
        op monitor interval=30s \
        meta target-role=Started
group FS_IP fs_mail vip_193.239.30.23 vip_172.27.250.7
group Services postgres_pg2 Dovecot
ms ms_drbd_mail drbd_mail \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true is-managed=true target-role=Started
order FS_IP_after_drbd inf: ms_drbd_mail:promote FS_IP:start
order dovecot_after_FS_IP inf: FS_IP:start Services:start
location drbd-fence-by-handler-r0-ms_drbd_mail ms_drbd_mail \
        rule $role=Master -inf: #uname ne mail2
colocation mail_fs_on_drbd inf: FS_IP Services ms_drbd_mail:Master
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.15-e174ec8 \
        cluster-infrastructure=corosync \
        cluster-name=mail \
        stonith-enabled=false \
        last-lrm-refresh=1488034535 \
        no-quorum-policy=ignore

-- 
=========================================================
   inqbus Scientific Computing    Dr.  Volker Jaenisch
   Richard-Strauss-Straße 1       +49(08861) 690 474 0
   86956 Schongau-West            http://www.inqbus.de
=========================================================