[DRBD-user] Pacemaker/drbd split-brain after reboot only if filesystem resource is active

Digimer lists at alteeve.ca
Sat Nov 7 22:38:57 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


A storage cluster without fencing begs for a split-brain.

Please, first, setup and test stonith. Make sure it is working, then if
your issue remains you can start to sanely debug that.

Setup stonith in pacemaker and then configure DRBD to use 'fencing
resource-and-stonith' and the crm-{un,}fence-peer.sh.

digimer

On 06/11/15 10:27 AM, Philipp Dittert wrote:
> Hi,
> 
> I want to build a HA nfs storage with pacemaker and drbd, but getting so
> problems which i couldn't solve by myself. So need your help.
> 
> 
> i've got a five node cluster where 2 nodes are sharing a drbd volume
> (Primary/Secondary). The cluster is configured as "opt-in" cluster, so i
> need to define which resource can run on a node. A pacemaker location
> contraint is used to set a specific node (node4) as my primary drbd
> node. All is working as expected, e.g. if i move the drbd resource via
> "crm resource migrate ms_drbd node3" it will be correctly promoted as
> master. If node4 is the primary drbd node and i reboot the machine, the
> resouce will be moved to node3. After node4 is back, the resource is
> migrated back to it. drbd is in sync and all is good. If i add a
> filesystem resource which should handle the drbd mount and reboot node4
> the sync fails at some point after it is online again and i get a split
> brain. 
> 
> 
> Here is my pacemaker config:
> 
> node $id="167798047" lb
> node $id="167798048" node1
> node $id="167798049" node2
> node $id="167798050" node3 \
> attributes standby="off"
> node $id="167798051" node4 \
> attributes standby="off"
> primitive drbd ocf:linbit:drbd \
> params drbd_resource="shared" \
> op monitor interval="29s" role="Master" \
> op monitor interval="31s" role="Slave"
> primitive p_fs-drbd ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/shared" fstype="ext4" \
> op monitor timeout="40" interval="20" depth="0" \
> meta target-role="Started"
> ms ms_drbd drbd \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true"
> location cli-prefer-ms_drbd ms_drbd inf: node4
> location location-ms_drbd-node3-200 ms_drbd 200: node3
> location location-ms_drbd-node4-0 ms_drbd 0: node4
> location location-p_fs-drbd-node3-0 p_fs-drbd 0: node3
> location location-p_fs-drbd-node4-0 p_fs-drbd 0: node4
> colocation fs_on_drbd inf: p_fs-drbd ms_drbd:Master
> order mount_after_drbd inf: ms_drbd:promote p_fs-drbd:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> symmetric-cluster="false" \
> stonith-enabled="false" \
> last-lrm-refresh="1446820125"
> 
> 
> 
> Syslog on node3 WITHOUT filesystem resource:
> 
> 
> Nov  6 15:25:02 node3 corosync[1194]:   [QUORUM] Members[5]: 167798047
> 167798048 167798049 167798050 167798051
> Nov  6 15:25:02 node3 corosync[1194]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Nov  6 15:25:02 node3 crmd[1907]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node4[167798051] - state is now member
> (was lost)
> Nov  6 15:25:02 node3 pacemakerd[1900]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node4[167798051] - state is now member
> (was lost)
> Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=533, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=536, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:44 node3 kernel: [80590.188119] drbd shared: Handshake
> successful: Agreed network protocol version 101
> Nov  6 15:25:44 node3 kernel: [80590.188124] drbd shared: Agreed to
> support TRIM on protocol level
> Nov  6 15:25:44 node3 kernel: [80590.188161] drbd shared: conn(
> WFConnection -> WFReportParams ) 
> Nov  6 15:25:44 node3 kernel: [80590.188164] drbd shared: Starting
> asender thread (from drbd_r_shared [12796])
> Nov  6 15:25:44 node3 kernel: [80590.208166] block drbd0:
> drbd_sync_handshake:
> Nov  6 15:25:44 node3 kernel: [80590.208173] block drbd0: self
> D2B7CD09671C54D5:B1B9AE133EA69D76:9E43F7F6666358F4:9E42F7F6666358F4
> bits:0 flags:0
> Nov  6 15:25:44 node3 kernel: [80590.208178] block drbd0: peer
> B1B9AE133EA69D76:0000000000000000:9E43F7F6666358F4:9E42F7F6666358F4
> bits:0 flags:2
> Nov  6 15:25:44 node3 kernel: [80590.208181] block drbd0:
> uuid_compare()=1 by rule 70
> Nov  6 15:25:44 node3 kernel: [80590.208191] block drbd0: peer( Unknown
> -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
> Consistent ) 
> Nov  6 15:25:44 node3 kernel: [80590.208675] block drbd0: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Nov  6 15:25:44 node3 kernel: [80590.213106] block drbd0: receive bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Nov  6 15:25:44 node3 kernel: [80590.213115] block drbd0: helper
> command: /sbin/drbdadm before-resync-source minor-0
> Nov  6 15:25:44 node3 kernel: [80590.215205] block drbd0: helper
> command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
> Nov  6 15:25:44 node3 kernel: [80590.215229] block drbd0: conn(
> WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) 
> Nov  6 15:25:44 node3 kernel: [80590.215244] block drbd0: Began resync
> as SyncSource (will sync 0 KB [0 bits set]).
> Nov  6 15:25:44 node3 kernel: [80590.215306] block drbd0: updated sync
> UUID D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76:9E43F7F6666358F4
> Nov  6 15:25:44 node3 kernel: [80590.239925] block drbd0: Resync done
> (total 1 sec; paused 0 sec; 0 K/sec)
> Nov  6 15:25:44 node3 kernel: [80590.239933] block drbd0: updated UUIDs
> D2B7CD09671C54D5:0000000000000000:B1BAAE133EA69D76:B1B9AE133EA69D76
> Nov  6 15:25:44 node3 kernel: [80590.239943] block drbd0: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
> Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event:
> node3-drbd_monitor_29000:530 [ \n ]
> Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=541, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=545, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:45 node3 kernel: [80590.674679] block drbd0: role( Primary
> -> Secondary ) 
> Nov  6 15:25:45 node3 kernel: [80590.674755] block drbd0: bitmap WRITE
> of 0 pages took 0 jiffies
> Nov  6 15:25:45 node3 kernel: [80590.674761] block drbd0: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_demote_0 (call=548, rc=0, cib-update=131, confirmed=true) ok
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=551, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=554, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:45 node3 kernel: [80590.826422] block drbd0: peer(
> Secondary -> Primary ) 
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=557, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_monitor_31000 (call=560, rc=0, cib-update=132,
> confirmed=false) ok
> 
> 
> 
> Syslog on node3 WITH filesystem resource:
> 
> Nov  6 15:32:26 node3 corosync[1194]:   [QUORUM] Members[5]: 167798047
> 167798048 167798049 167798050 167798051
> Nov  6 15:32:26 node3 corosync[1194]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Nov  6 15:32:26 node3 pacemakerd[1900]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node4[167798051] - state is now member
> (was lost)
> Nov  6 15:32:26 node3 crmd[1907]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node4[167798051] - state is now member
> (was lost)
> Nov  6 15:33:07 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=614, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:07 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=617, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:08 node3 kernel: [81033.978683] drbd shared: Handshake
> successful: Agreed network protocol version 101
> Nov  6 15:33:08 node3 kernel: [81033.978688] drbd shared: Agreed to
> support TRIM on protocol level
> Nov  6 15:33:08 node3 kernel: [81033.978734] drbd shared: conn(
> WFConnection -> WFReportParams ) 
> Nov  6 15:33:08 node3 kernel: [81033.978738] drbd shared: Starting
> asender thread (from drbd_r_shared [12796])
> Nov  6 15:33:08 node3 kernel: [81033.996165] block drbd0:
> drbd_sync_handshake:
> Nov  6 15:33:08 node3 kernel: [81033.996172] block drbd0: self
> 9D5DD7BC88FC931D:D2B7CD09671C54D4:B1BAAE133EA69D76:B1B9AE133EA69D76
> bits:3 flags:0
> Nov  6 15:33:08 node3 kernel: [81033.996176] block drbd0: peer
> BB5613199934BE6E:D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76
> bits:3072 flags:2
> Nov  6 15:33:08 node3 kernel: [81033.996179] block drbd0:
> uuid_compare()=100 by rule 90
> Nov  6 15:33:08 node3 kernel: [81033.996185] block drbd0: helper
> command: /sbin/drbdadm initial-split-brain minor-0
> Nov  6 15:33:08 node3 kernel: [81033.998232] block drbd0: helper
> command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
> Nov  6 15:33:08 node3 kernel: [81033.998252] block drbd0: Split-Brain
> detected but unresolved, dropping connection!
> Nov  6 15:33:08 node3 kernel: [81033.998312] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0
> Nov  6 15:33:08 node3 kernel: [81034.000415] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> Nov  6 15:33:08 node3 kernel: [81034.000451] drbd shared: conn(
> WFReportParams -> Disconnecting ) 
> Nov  6 15:33:08 node3 kernel: [81034.000455] drbd shared: error
> receiving ReportState, e: -5 l: 0!
> Nov  6 15:33:08 node3 kernel: [81034.000510] drbd shared: asender terminated
> Nov  6 15:33:08 node3 kernel: [81034.000512] drbd shared: Terminating
> drbd_a_shared
> Nov  6 15:33:08 node3 kernel: [81034.032189] drbd shared: Connection closed
> Nov  6 15:33:08 node3 kernel: [81034.032287] drbd shared: conn(
> Disconnecting -> StandAlone ) 
> Nov  6 15:33:08 node3 kernel: [81034.032290] drbd shared: receiver
> terminated
> Nov  6 15:33:08 node3 kernel: [81034.032292] drbd shared: Terminating
> drbd_r_shared
> Nov  6 15:33:08 node3 crmd[1907]:   notice: process_lrm_event:
> node3-drbd_monitor_29000:607 [ \n ]
> Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Running stop
> for /dev/drbd0 on /shared
> Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Trying to
> unmount /shared
> Nov  6 15:33:08 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=625, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't
> unmount /shared; trying cleanup with TERM
> Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes
> on /shared were signalled
> Nov  6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't
> unmount /shared; trying cleanup with TERM
> Nov  6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes
> on /shared were signalled
> Nov  6 15:33:10 node3 Filesystem(p_fs-drbd)[26831]: INFO: unmounted
> /shared successfully
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [         (In some cases useful info about
> processes that use ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [          the device is found by lsof(8)
> or fuser(1)) ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [ /shared:            cc ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [         (In some cases useful info about
> processes that use ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [          the device is found by lsof(8)
> or fuser(1)) ]
> Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
> p_fs-drbd_stop_0:26831:stderr [ /shared:            c ]
> Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation p_fs-drbd_stop_0 (call=623, rc=0, cib-update=153,
> confirmed=true) ok
> Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=631, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:10 node3 kernel: [81036.550927] block drbd0: role( Primary
> -> Secondary ) 
> Nov  6 15:33:10 node3 kernel: [81036.551051] block drbd0: bitmap WRITE
> of 2 pages took 0 jiffies
> Nov  6 15:33:10 node3 kernel: [81036.551057] block drbd0: 12 KB (3 bits)
> marked out-of-sync by on disk bit-map.
> Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_demote_0 (call=634, rc=0, cib-update=154, confirmed=true) ok
> Nov  6 15:33:11 node3 attrd[1905]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-drbd (1000)
> Nov  6 15:33:11 node3 attrd[1905]:   notice: attrd_perform_update: Sent
> update 235: master-drbd=1000
> Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=637, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=640, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_notify_0 (call=643, rc=0, cib-update=0, confirmed=true) ok
> Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
> operation drbd_monitor_31000 (call=646, rc=0, cib-update=155,
> confirmed=false) ok
> 
> 
> Did i miss something?
> 
> Thanks!
> 
> 
> 
> Regards,
> 
> Philipp
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



More information about the drbd-user mailing list