Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I want to build a HA nfs storage with pacemaker and drbd, but getting so problems which i couldn't solve by myself. So need your help. i've got a five node cluster where 2 nodes are sharing a drbd volume (Primary/Secondary). The cluster is configured as "opt-in" cluster, so i need to define which resource can run on a node. A pacemaker location contraint is used to set a specific node (node4) as my primary drbd node. All is working as expected, e.g. if i move the drbd resource via "crm resource migrate ms_drbd node3" it will be correctly promoted as master. If node4 is the primary drbd node and i reboot the machine, the resouce will be moved to node3. After node4 is back, the resource is migrated back to it. drbd is in sync and all is good. If i add a filesystem resource which should handle the drbd mount and reboot node4 the sync fails at some point after it is online again and i get a split brain. Here is my pacemaker config: node $id="167798047" lb node $id="167798048" node1 node $id="167798049" node2 node $id="167798050" node3 \ attributes standby="off" node $id="167798051" node4 \ attributes standby="off" primitive drbd ocf:linbit:drbd \ params drbd_resource="shared" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave" primitive p_fs-drbd ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/shared" fstype="ext4" \ op monitor timeout="40" interval="20" depth="0" \ meta target-role="Started" ms ms_drbd drbd \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location cli-prefer-ms_drbd ms_drbd inf: node4 location location-ms_drbd-node3-200 ms_drbd 200: node3 location location-ms_drbd-node4-0 ms_drbd 0: node4 location location-p_fs-drbd-node3-0 p_fs-drbd 0: node3 location location-p_fs-drbd-node4-0 p_fs-drbd 0: node4 colocation fs_on_drbd inf: p_fs-drbd ms_drbd:Master order mount_after_drbd inf: ms_drbd:promote p_fs-drbd:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ symmetric-cluster="false" \ stonith-enabled="false" \ last-lrm-refresh="1446820125" Syslog on node3 WITHOUT filesystem resource: Nov 6 15:25:02 node3 corosync[1194]: [QUORUM] Members[5]: 167798047 167798048 167798049 167798050 167798051 Nov 6 15:25:02 node3 corosync[1194]: [MAIN ] Completed service synchronization, ready to provide service. Nov 6 15:25:02 node3 crmd[1907]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node4[167798051] - state is now member (was lost) Nov 6 15:25:02 node3 pacemakerd[1900]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node4[167798051] - state is now member (was lost) Nov 6 15:25:44 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=533, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:44 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=536, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:44 node3 kernel: [80590.188119] drbd shared: Handshake successful: Agreed network protocol version 101 Nov 6 15:25:44 node3 kernel: [80590.188124] drbd shared: Agreed to support TRIM on protocol level Nov 6 15:25:44 node3 kernel: [80590.188161] drbd shared: conn( WFConnection -> WFReportParams ) Nov 6 15:25:44 node3 kernel: [80590.188164] drbd shared: Starting asender thread (from drbd_r_shared [12796]) Nov 6 15:25:44 node3 kernel: [80590.208166] block drbd0: drbd_sync_handshake: Nov 6 15:25:44 node3 kernel: [80590.208173] block drbd0: self D2B7CD09671C54D5:B1B9AE133EA69D76:9E43F7F6666358F4:9E42F7F6666358F4 bits:0 flags:0 Nov 6 15:25:44 node3 kernel: [80590.208178] block drbd0: peer B1B9AE133EA69D76:0000000000000000:9E43F7F6666358F4:9E42F7F6666358F4 bits:0 flags:2 Nov 6 15:25:44 node3 kernel: [80590.208181] block drbd0: uuid_compare()=1 by rule 70 Nov 6 15:25:44 node3 kernel: [80590.208191] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Nov 6 15:25:44 node3 kernel: [80590.208675] block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Nov 6 15:25:44 node3 kernel: [80590.213106] block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Nov 6 15:25:44 node3 kernel: [80590.213115] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Nov 6 15:25:44 node3 kernel: [80590.215205] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Nov 6 15:25:44 node3 kernel: [80590.215229] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Nov 6 15:25:44 node3 kernel: [80590.215244] block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). Nov 6 15:25:44 node3 kernel: [80590.215306] block drbd0: updated sync UUID D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76:9E43F7F6666358F4 Nov 6 15:25:44 node3 kernel: [80590.239925] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Nov 6 15:25:44 node3 kernel: [80590.239933] block drbd0: updated UUIDs D2B7CD09671C54D5:0000000000000000:B1BAAE133EA69D76:B1B9AE133EA69D76 Nov 6 15:25:44 node3 kernel: [80590.239943] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Nov 6 15:25:44 node3 crmd[1907]: notice: process_lrm_event: node3-drbd_monitor_29000:530 [ \n ] Nov 6 15:25:44 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=541, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=545, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:45 node3 kernel: [80590.674679] block drbd0: role( Primary -> Secondary ) Nov 6 15:25:45 node3 kernel: [80590.674755] block drbd0: bitmap WRITE of 0 pages took 0 jiffies Nov 6 15:25:45 node3 kernel: [80590.674761] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_demote_0 (call=548, rc=0, cib-update=131, confirmed=true) ok Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=551, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=554, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:45 node3 kernel: [80590.826422] block drbd0: peer( Secondary -> Primary ) Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=557, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:25:45 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_monitor_31000 (call=560, rc=0, cib-update=132, confirmed=false) ok Syslog on node3 WITH filesystem resource: Nov 6 15:32:26 node3 corosync[1194]: [QUORUM] Members[5]: 167798047 167798048 167798049 167798050 167798051 Nov 6 15:32:26 node3 corosync[1194]: [MAIN ] Completed service synchronization, ready to provide service. Nov 6 15:32:26 node3 pacemakerd[1900]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node4[167798051] - state is now member (was lost) Nov 6 15:32:26 node3 crmd[1907]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node4[167798051] - state is now member (was lost) Nov 6 15:33:07 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=614, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:07 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=617, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:08 node3 kernel: [81033.978683] drbd shared: Handshake successful: Agreed network protocol version 101 Nov 6 15:33:08 node3 kernel: [81033.978688] drbd shared: Agreed to support TRIM on protocol level Nov 6 15:33:08 node3 kernel: [81033.978734] drbd shared: conn( WFConnection -> WFReportParams ) Nov 6 15:33:08 node3 kernel: [81033.978738] drbd shared: Starting asender thread (from drbd_r_shared [12796]) Nov 6 15:33:08 node3 kernel: [81033.996165] block drbd0: drbd_sync_handshake: Nov 6 15:33:08 node3 kernel: [81033.996172] block drbd0: self 9D5DD7BC88FC931D:D2B7CD09671C54D4:B1BAAE133EA69D76:B1B9AE133EA69D76 bits:3 flags:0 Nov 6 15:33:08 node3 kernel: [81033.996176] block drbd0: peer BB5613199934BE6E:D2B7CD09671C54D5:B1BAAE133EA69D76:B1B9AE133EA69D76 bits:3072 flags:2 Nov 6 15:33:08 node3 kernel: [81033.996179] block drbd0: uuid_compare()=100 by rule 90 Nov 6 15:33:08 node3 kernel: [81033.996185] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 Nov 6 15:33:08 node3 kernel: [81033.998232] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) Nov 6 15:33:08 node3 kernel: [81033.998252] block drbd0: Split-Brain detected but unresolved, dropping connection! Nov 6 15:33:08 node3 kernel: [81033.998312] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Nov 6 15:33:08 node3 kernel: [81034.000415] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Nov 6 15:33:08 node3 kernel: [81034.000451] drbd shared: conn( WFReportParams -> Disconnecting ) Nov 6 15:33:08 node3 kernel: [81034.000455] drbd shared: error receiving ReportState, e: -5 l: 0! Nov 6 15:33:08 node3 kernel: [81034.000510] drbd shared: asender terminated Nov 6 15:33:08 node3 kernel: [81034.000512] drbd shared: Terminating drbd_a_shared Nov 6 15:33:08 node3 kernel: [81034.032189] drbd shared: Connection closed Nov 6 15:33:08 node3 kernel: [81034.032287] drbd shared: conn( Disconnecting -> StandAlone ) Nov 6 15:33:08 node3 kernel: [81034.032290] drbd shared: receiver terminated Nov 6 15:33:08 node3 kernel: [81034.032292] drbd shared: Terminating drbd_r_shared Nov 6 15:33:08 node3 crmd[1907]: notice: process_lrm_event: node3-drbd_monitor_29000:607 [ \n ] Nov 6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Running stop for /dev/drbd0 on /shared Nov 6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Trying to unmount /shared Nov 6 15:33:08 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=625, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't unmount /shared; trying cleanup with TERM Nov 6 15:33:08 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes on /shared were signalled Nov 6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: ERROR: Couldn't unmount /shared; trying cleanup with TERM Nov 6 15:33:09 node3 Filesystem(p_fs-drbd)[26831]: INFO: Some processes on /shared were signalled Nov 6 15:33:10 node3 Filesystem(p_fs-drbd)[26831]: INFO: unmounted /shared successfully Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ (In some cases useful info about processes that use ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ the device is found by lsof(8) or fuser(1)) ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ /shared: cc ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ umount: /shared: device is busy. ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ (In some cases useful info about processes that use ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ the device is found by lsof(8) or fuser(1)) ] Nov 6 15:33:10 node3 lrmd[1904]: notice: operation_finished: p_fs-drbd_stop_0:26831:stderr [ /shared: c ] Nov 6 15:33:10 node3 crmd[1907]: notice: process_lrm_event: LRM operation p_fs-drbd_stop_0 (call=623, rc=0, cib-update=153, confirmed=true) ok Nov 6 15:33:10 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=631, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:10 node3 kernel: [81036.550927] block drbd0: role( Primary -> Secondary ) Nov 6 15:33:10 node3 kernel: [81036.551051] block drbd0: bitmap WRITE of 2 pages took 0 jiffies Nov 6 15:33:10 node3 kernel: [81036.551057] block drbd0: 12 KB (3 bits) marked out-of-sync by on disk bit-map. Nov 6 15:33:10 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_demote_0 (call=634, rc=0, cib-update=154, confirmed=true) ok Nov 6 15:33:11 node3 attrd[1905]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (1000) Nov 6 15:33:11 node3 attrd[1905]: notice: attrd_perform_update: Sent update 235: master-drbd=1000 Nov 6 15:33:11 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=637, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:11 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=640, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:11 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_notify_0 (call=643, rc=0, cib-update=0, confirmed=true) ok Nov 6 15:33:11 node3 crmd[1907]: notice: process_lrm_event: LRM operation drbd_monitor_31000 (call=646, rc=0, cib-update=155, confirmed=false) ok Did i miss something? Thanks! Regards, Philipp -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20151106/363bf7f2/attachment.htm>