Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I am struggling with a problem since two days and found no solution, yet. I think it might be something trivially simple I am overlooking. I have two fresh Ubuntu 14.04.3 systems installed in Qemu. (I can provide the disk images on request, if anyone needs it to show me the problem). The following software is installed: drbd8-utils 2:8.4.4-1ubuntu1 pacemaker 1.1.10+git20130802-1ubuntu2.3 corosync 2.3.3-1ubuntu1 I am using the LTS trusty kernel 3.13.0-68-generic. The drbd initscript is disabled. (update-rc.d -f drbd remove). I have the attached corosync.conf on both nodes. My DRBD resource r0 looks like: resource r0 { device /dev/drbd0 minor 0; disk /dev/sdb1; meta-disk internal; on drbd01 { address 10.20.42.71:7780; } on drbd02 { address 10.20.42.72:7780; } } I haven't changed anything in /etc/drbd.d/global_common.conf. My CRM configuration is simple and nearly the same as the example in the DRBD manual without MySQL: node $id="169093703" drbd01 node $id="169093704" drbd02 primitive p_drbd ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave" primitive p_filesystem ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/drbd" fstype="ext4" primitive p_sharedip ocf:heartbeat:IPaddr2 \ params ip="10.20.42.70" nic="eth0" group grp_drbd p_filesystem p_sharedip ms ms_drbd p_drbd \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" colocation ip_on_drbd inf: grp_drbd ms_drbd:Master order ip_after_drbd inf: ms_drbd:promote grp_drbd:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" All looks good to me looking into crm_mon: Last updated: Fri Nov 13 17:00:40 2015 Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01 Stack: corosync Current DC: drbd01 (169093703) - partition with quorum Version: 1.1.10-42f2063 2 Nodes configured 4 Resources configured Online: [ drbd01 drbd02 ] Master/Slave Set: ms_drbd [p_drbd] Masters: [ drbd01 ] Slaves: [ drbd02 ] Resource Group: grp_drbd p_filesystem (ocf::heartbeat:Filesystem): Started drbd01 p_sharedip (ocf::heartbeat:IPaddr2): Started drbd01 The DRBD is fine, too: root at drbd01:~# cat /proc/drbd version: 8.4.3 (api:1/proto:86-101) srcversion: 6551AD2C98F533733BE558C 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:4096 nr:0 dw:4 dr:4841 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 I then reboot drbd01 and the failover works great: Last updated: Fri Nov 13 17:02:32 2015 Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01 Stack: corosync Current DC: drbd02 (169093704) - partition with quorum Version: 1.1.10-42f2063 2 Nodes configured 4 Resources configured Online: [ drbd01 drbd02 ] Master/Slave Set: ms_drbd [p_drbd] Masters: [ drbd02 ] Slaves: [ drbd01 ] Resource Group: grp_drbd p_filesystem (ocf::heartbeat:Filesystem): Started drbd02 p_sharedip (ocf::heartbeat:IPaddr2): Started drbd02 Everything looks nice in the CRM perspective. But when I reconnect into drbd01 I see a unresolved split-brain: cat /proc/drbd version: 8.4.3 (api:1/proto:86-101) srcversion: 6551AD2C98F533733BE558C 0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4096 With following dmesg output: [ 7.430374] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101) [ 7.430376] drbd: srcversion: 6551AD2C98F533733BE558C [ 7.430377] drbd: registered as block device major 147 [ 7.468725] d-con r0: Starting worker thread (from drbdsetup [970]) [ 7.469322] block drbd0: disk( Diskless -> Attaching ) [ 7.469426] d-con r0: Method to ensure write ordering: flush [ 7.469428] block drbd0: max BIO size = 1048576 [ 7.469432] block drbd0: drbd_bm_resize called with capacity == 4192056 [ 7.469440] block drbd0: resync bitmap: bits=524007 words=8188 pages=16 [ 7.469442] block drbd0: size = 2047 MB (2096028 KB) [ 7.469976] block drbd0: bitmap READ of 16 pages took 0 jiffies [ 7.469986] block drbd0: recounting of set bits took additional 0 jiffies [ 7.469987] block drbd0: 4096 KB (1024 bits) marked out-of-sync by on disk bit-map. [ 7.470001] block drbd0: disk( Attaching -> UpToDate ) [ 7.470003] block drbd0: attached to UUIDs 44F1F08DBF5F3F59:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 [ 7.477742] d-con r0: conn( StandAlone -> Unconnected ) [ 7.477753] d-con r0: Starting receiver thread (from drbd_w_r0 [971]) [ 7.478619] d-con r0: receiver (re)started [ 7.478627] d-con r0: conn( Unconnected -> WFConnection ) [ 7.979066] d-con r0: Handshake successful: Agreed network protocol version 101 [ 7.979150] d-con r0: conn( WFConnection -> WFReportParams ) [ 7.979152] d-con r0: Starting asender thread (from drbd_r_r0 [980]) [ 7.979342] block drbd0: drbd_sync_handshake: [ 7.979345] block drbd0: self 44F1F08DBF5F3F58:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1024 flags:0 [ 7.979347] block drbd0: peer 263D532088F42DC9:4EAEF009CE66D738:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1 flags:0 [ 7.979349] block drbd0: uuid_compare()=100 by rule 90 [ 7.979351] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 [ 7.980176] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) [ 7.980186] block drbd0: Split-Brain detected but unresolved, dropping connection! [ 7.980502] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 [ 7.981054] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) [ 7.981070] d-con r0: conn( WFReportParams -> Disconnecting ) [ 7.981072] d-con r0: error receiving ReportState, e: -5 l: 0! [ 7.981272] d-con r0: asender terminated [ 7.981273] d-con r0: Terminating drbd_a_r0 [ 7.981410] d-con r0: Connection closed [ 7.981416] d-con r0: conn( Disconnecting -> StandAlone ) [ 7.981417] d-con r0: receiver terminated [ 7.981418] d-con r0: Terminating drbd_r_r0 Is this the expected behavior when no fencing or stonith is enabled in my two cluster node system? I have seen this posting, but the help didn't solve my problem. http://serverfault.com/questions/663106/split-brain-on-drbd-and-pacemaker-cluster best regards Waldemar