[DRBD-user] Pacemaker/drbd split-brain after reboot only if filesystem resource is active

Philipp Dittert philipp.dittert at gmail.com
Fri Nov 6 16:27:11 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I want to build a HA nfs storage with pacemaker and drbd, but getting so
problems which i couldn't solve by myself. So need your help.


i've got a five node cluster where 2 nodes are sharing a drbd volume
(Primary/Secondary). The cluster is configured as "opt-in" cluster, so i
need to define which resource can run on a node. A pacemaker location
contraint is used to set a specific node (node4) as my primary drbd node.
All is working as expected, e.g. if i move the drbd resource via "crm
resource migrate ms_drbd node3" it will be correctly promoted as master. If
node4 is the primary drbd node and i reboot the machine, the resouce will
be moved to node3. After node4 is back, the resource is migrated back to
it. drbd is in sync and all is good. If i add a filesystem resource which
should handle the drbd mount and reboot node4 the sync fails at some point
after it is online again and i get a split brain.


Here is my pacemaker config:

node $id="167798047" lb
node $id="167798048" node1
node $id="167798049" node2
node $id="167798050" node3 \
attributes standby="off"
node $id="167798051" node4 \
attributes standby="off"
primitive drbd ocf:linbit:drbd \
params drbd_resource="shared" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive p_fs-drbd ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/shared" fstype="ext4" \
op monitor timeout="40" interval="20" depth="0" \
meta target-role="Started"
ms ms_drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
location cli-prefer-ms_drbd ms_drbd inf: node4
location location-ms_drbd-node3-200 ms_drbd 200: node3
location location-ms_drbd-node4-0 ms_drbd 0: node4
location location-p_fs-drbd-node3-0 p_fs-drbd 0: node3
location location-p_fs-drbd-node4-0 p_fs-drbd 0: node4
colocation fs_on_drbd inf: p_fs-drbd ms_drbd:Master
order mount_after_drbd inf: ms_drbd:promote p_fs-drbd:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
symmetric-cluster="false" \
stonith-enabled="false" \
last-lrm-refresh="1446820125"



Syslog on node3 WITHOUT filesystem resource:


Nov  6 15:25:02 node3 corosync[1194]:   [QUORUM] Members[5]: 167798047
167798048 167798049 167798050 167798051
Nov  6 15:25:02 node3 corosync[1194]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Nov  6 15:25:02 node3 crmd[1907]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node4[167798051] - state is now member (was
lost)
Nov  6 15:25:02 node3 pacemakerd[1900]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node4[167798051] - state is now member (was
lost)
Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=533, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=536, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:44 node3 kernel: [80590.188119] drbd shared: Handshake
successful: Agreed network protocol version 101
Nov  6 15:25:44 node3 kernel: [80590.188124] drbd shared: Agreed to support
TRIM on protocol level
Nov  6 15:25:44 node3 kernel: [80590.188161] drbd shared: conn(
WFConnection -> WFReportParams )
Nov  6 15:25:44 node3 kernel: [80590.188164] drbd shared: Starting asender
thread (from drbd_r_shared [12796])
Nov  6 15:25:44 node3 kernel: [80590.208166] block drbd0:
drbd_sync_handshake:
Nov  6 15:25:44 node3 kernel: [80590.208173] block drbd0: self
D2B7CD09671C54D5:B1B9AE133EA69D76:9E43F7F6666358F4:9E42F7F6666358F4 bits:0
flags:0
Nov  6 15:25:44 node3 kernel: [80590.208178] block drbd0: peer
B1B9AE133EA69D76:0000000000000000:9E43F7F6666358F4:9E42F7F6666358F4 bits:0
flags:2
Nov  6 15:25:44 node3 kernel: [80590.208181] block drbd0: uuid_compare()=1
by rule 70
Nov  6 15:25:44 node3 kernel: [80590.208191] block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
Consistent )
Nov  6 15:25:44 node3 kernel: [80590.208675] block drbd0: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Nov  6 15:25:44 node3 kernel: [80590.213106] block drbd0: receive bitmap
stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Nov  6 15:25:44 node3 kernel: [80590.213115] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0
Nov  6 15:25:44 node3 kernel: [80590.215205] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Nov  6 15:25:44 node3 kernel: [80590.215229] block drbd0: conn( WFBitMapS
-> SyncSource ) pdsk( Consistent -> Inconsistent )
Nov  6 15:25:44 node3 kernel: [80590.215244] block drbd0: Began resync as
SyncSource (will sync 0 KB [0 bits set]).
Nov  6 15:25:44 node3 kernel: [80590.215306] block drbd0: updated sync UUID
D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76:9E43F7F6666358F4
Nov  6 15:25:44 node3 kernel: [80590.239925] block drbd0: Resync done
(total 1 sec; paused 0 sec; 0 K/sec)
Nov  6 15:25:44 node3 kernel: [80590.239933] block drbd0: updated UUIDs
D2B7CD09671C54D5:0000000000000000:B1BAAE133EA69D76:B1B9AE133EA69D76
Nov  6 15:25:44 node3 kernel: [80590.239943] block drbd0: conn( SyncSource
-> Connected ) pdsk( Inconsistent -> UpToDate )
Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event:
node3-drbd_monitor_29000:530 [ \n ]
Nov  6 15:25:44 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=541, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=545, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:45 node3 kernel: [80590.674679] block drbd0: role( Primary ->
Secondary )
Nov  6 15:25:45 node3 kernel: [80590.674755] block drbd0: bitmap WRITE of 0
pages took 0 jiffies
Nov  6 15:25:45 node3 kernel: [80590.674761] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_demote_0 (call=548, rc=0, cib-update=131, confirmed=true) ok
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=551, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=554, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:45 node3 kernel: [80590.826422] block drbd0: peer( Secondary
-> Primary )
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=557, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:25:45 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_monitor_31000 (call=560, rc=0, cib-update=132,
confirmed=false) ok



Syslog on node3 WITH filesystem resource:

Nov  6 15:32:26 node3 corosync[1194]:   [QUORUM] Members[5]: 167798047
167798048 167798049 167798050 167798051
Nov  6 15:32:26 node3 corosync[1194]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Nov  6 15:32:26 node3 pacemakerd[1900]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node4[167798051] - state is now member (was
lost)
Nov  6 15:32:26 node3 crmd[1907]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node4[167798051] - state is now member (was
lost)
Nov  6 15:33:07 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=614, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:07 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=617, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:08 node3 kernel: [81033.978683] drbd shared: Handshake
successful: Agreed network protocol version 101
Nov  6 15:33:08 node3 kernel: [81033.978688] drbd shared: Agreed to support
TRIM on protocol level
Nov  6 15:33:08 node3 kernel: [81033.978734] drbd shared: conn(
WFConnection -> WFReportParams )
Nov  6 15:33:08 node3 kernel: [81033.978738] drbd shared: Starting asender
thread (from drbd_r_shared [12796])
Nov  6 15:33:08 node3 kernel: [81033.996165] block drbd0:
drbd_sync_handshake:
Nov  6 15:33:08 node3 kernel: [81033.996172] block drbd0: self
9D5DD7BC88FC931D:D2B7CD09671C54D4:B1BAAE133EA69D76:B1B9AE133EA69D76 bits:3
flags:0
Nov  6 15:33:08 node3 kernel: [81033.996176] block drbd0: peer
BB5613199934BE6E:D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76
bits:3072 flags:2
Nov  6 15:33:08 node3 kernel: [81033.996179] block drbd0:
uuid_compare()=100 by rule 90
Nov  6 15:33:08 node3 kernel: [81033.996185] block drbd0: helper command:
/sbin/drbdadm initial-split-brain minor-0
Nov  6 15:33:08 node3 kernel: [81033.998232] block drbd0: helper command:
/sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
Nov  6 15:33:08 node3 kernel: [81033.998252] block drbd0: Split-Brain
detected but unresolved, dropping connection!
Nov  6 15:33:08 node3 kernel: [81033.998312] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0
Nov  6 15:33:08 node3 kernel: [81034.000415] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Nov  6 15:33:08 node3 kernel: [81034.000451] drbd shared: conn(
WFReportParams -> Disconnecting )
Nov  6 15:33:08 node3 kernel: [81034.000455] drbd shared: error receiving
ReportState, e: -5 l: 0!
Nov  6 15:33:08 node3 kernel: [81034.000510] drbd shared: asender terminated
Nov  6 15:33:08 node3 kernel: [81034.000512] drbd shared: Terminating
drbd_a_shared
Nov  6 15:33:08 node3 kernel: [81034.032189] drbd shared: Connection closed
Nov  6 15:33:08 node3 kernel: [81034.032287] drbd shared: conn(
Disconnecting -> StandAlone )
Nov  6 15:33:08 node3 kernel: [81034.032290] drbd shared: receiver
terminated
Nov  6 15:33:08 node3 kernel: [81034.032292] drbd shared: Terminating
drbd_r_shared
Nov  6 15:33:08 node3 crmd[1907]:   notice: process_lrm_event:
node3-drbd_monitor_29000:607 [ \n ]
Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Running stop for
/dev/drbd0 on /shared
Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Trying to unmount
/shared
Nov  6 15:33:08 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=625, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't unmount
/shared; trying cleanup with TERM
Nov  6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes on
/shared were signalled
Nov  6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't unmount
/shared; trying cleanup with TERM
Nov  6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes on
/shared were signalled
Nov  6 15:33:10 node3 Filesystem(p_fs-drbd)[26831]: INFO: unmounted /shared
successfully
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [         (In some cases useful info about
processes that use ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [          the device is found by lsof(8) or
fuser(1)) ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [ /shared:            cc ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [         (In some cases useful info about
processes that use ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [          the device is found by lsof(8) or
fuser(1)) ]
Nov  6 15:33:10 node3 lrmd[1904]:   notice: operation_finished:
p_fs-drbd_stop_0:26831:stderr [ /shared:            c ]
Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation p_fs-drbd_stop_0 (call=623, rc=0, cib-update=153, confirmed=true)
ok
Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=631, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:10 node3 kernel: [81036.550927] block drbd0: role( Primary ->
Secondary )
Nov  6 15:33:10 node3 kernel: [81036.551051] block drbd0: bitmap WRITE of 2
pages took 0 jiffies
Nov  6 15:33:10 node3 kernel: [81036.551057] block drbd0: 12 KB (3 bits)
marked out-of-sync by on disk bit-map.
Nov  6 15:33:10 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_demote_0 (call=634, rc=0, cib-update=154, confirmed=true) ok
Nov  6 15:33:11 node3 attrd[1905]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd (1000)
Nov  6 15:33:11 node3 attrd[1905]:   notice: attrd_perform_update: Sent
update 235: master-drbd=1000
Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=637, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=640, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_notify_0 (call=643, rc=0, cib-update=0, confirmed=true) ok
Nov  6 15:33:11 node3 crmd[1907]:   notice: process_lrm_event: LRM
operation drbd_monitor_31000 (call=646, rc=0, cib-update=155,
confirmed=false) ok


Did i miss something?

Thanks!



Regards,

Philipp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151106/363bf7f2/attachment.htm>


More information about the drbd-user mailing list