Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 13/11/15 11:12 AM, Waldemar Brodkorb wrote: > Hi, > > I am struggling with a problem since two days and found no solution, > yet. I think it might be something trivially simple I am overlooking. > > I have two fresh Ubuntu 14.04.3 systems installed in Qemu. (I can > provide the disk images on request, if anyone needs it to show me > the problem). > > The following software is installed: > drbd8-utils 2:8.4.4-1ubuntu1 > pacemaker 1.1.10+git20130802-1ubuntu2.3 > corosync 2.3.3-1ubuntu1 > > I am using the LTS trusty kernel 3.13.0-68-generic. > The drbd initscript is disabled. (update-rc.d -f drbd remove). > > I have the attached corosync.conf on both nodes. > My DRBD resource r0 looks like: > resource r0 { > device /dev/drbd0 minor 0; > disk /dev/sdb1; > meta-disk internal; > on drbd01 { > address 10.20.42.71:7780; > } > on drbd02 { > address 10.20.42.72:7780; > } > } > > I haven't changed anything in /etc/drbd.d/global_common.conf. > > My CRM configuration is simple and nearly the same as the example in > the DRBD manual without MySQL: > node $id="169093703" drbd01 > node $id="169093704" drbd02 > primitive p_drbd ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="29s" role="Master" \ > op monitor interval="31s" role="Slave" > primitive p_filesystem ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/drbd" fstype="ext4" > primitive p_sharedip ocf:heartbeat:IPaddr2 \ > params ip="10.20.42.70" nic="eth0" > group grp_drbd p_filesystem p_sharedip > ms ms_drbd p_drbd \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation ip_on_drbd inf: grp_drbd ms_drbd:Master > order ip_after_drbd inf: ms_drbd:promote grp_drbd:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > stonith-enabled="false" \ And here's the core of the problem. Configure and test stonith in pacemaker. Then, configure drbd to use 'fencing resource-and-stonith;' and configure 'crm-{un,}fence-peer.sh as the {un,}fence handlers. > no-quorum-policy="ignore" > > All looks good to me looking into crm_mon: > Last updated: Fri Nov 13 17:00:40 2015 > Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01 > Stack: corosync > Current DC: drbd01 (169093703) - partition with quorum > Version: 1.1.10-42f2063 > 2 Nodes configured > 4 Resources configured > > > Online: [ drbd01 drbd02 ] > > Master/Slave Set: ms_drbd [p_drbd] > Masters: [ drbd01 ] > Slaves: [ drbd02 ] > Resource Group: grp_drbd > p_filesystem (ocf::heartbeat:Filesystem): Started > drbd01 > p_sharedip (ocf::heartbeat:IPaddr2): Started drbd01 > > The DRBD is fine, too: > root at drbd01:~# cat /proc/drbd > version: 8.4.3 (api:1/proto:86-101) > srcversion: 6551AD2C98F533733BE558C > 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- > ns:4096 nr:0 dw:4 dr:4841 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > > > I then reboot drbd01 and the failover works great: > Last updated: Fri Nov 13 17:02:32 2015 > Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01 > Stack: corosync > Current DC: drbd02 (169093704) - partition with quorum > Version: 1.1.10-42f2063 > 2 Nodes configured > 4 Resources configured > > > Online: [ drbd01 drbd02 ] > > Master/Slave Set: ms_drbd [p_drbd] > Masters: [ drbd02 ] > Slaves: [ drbd01 ] > Resource Group: grp_drbd > p_filesystem (ocf::heartbeat:Filesystem): Started > drbd02 > p_sharedip (ocf::heartbeat:IPaddr2): Started drbd02 > > Everything looks nice in the CRM perspective. > > But when I reconnect into drbd01 I see a unresolved split-brain: > cat /proc/drbd > version: 8.4.3 (api:1/proto:86-101) > srcversion: 6551AD2C98F533733BE558C > 0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r----- > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4096 > > With following dmesg output: > [ 7.430374] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101) > [ 7.430376] drbd: srcversion: 6551AD2C98F533733BE558C > [ 7.430377] drbd: registered as block device major 147 > [ 7.468725] d-con r0: Starting worker thread (from drbdsetup [970]) > [ 7.469322] block drbd0: disk( Diskless -> Attaching ) > [ 7.469426] d-con r0: Method to ensure write ordering: flush > [ 7.469428] block drbd0: max BIO size = 1048576 > [ 7.469432] block drbd0: drbd_bm_resize called with capacity == 4192056 > [ 7.469440] block drbd0: resync bitmap: bits=524007 words=8188 pages=16 > [ 7.469442] block drbd0: size = 2047 MB (2096028 KB) > [ 7.469976] block drbd0: bitmap READ of 16 pages took 0 jiffies > [ 7.469986] block drbd0: recounting of set bits took additional 0 jiffies > [ 7.469987] block drbd0: 4096 KB (1024 bits) marked out-of-sync by on disk bit-map. > [ 7.470001] block drbd0: disk( Attaching -> UpToDate ) > [ 7.470003] block drbd0: attached to UUIDs 44F1F08DBF5F3F59:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 > [ 7.477742] d-con r0: conn( StandAlone -> Unconnected ) > [ 7.477753] d-con r0: Starting receiver thread (from drbd_w_r0 [971]) > [ 7.478619] d-con r0: receiver (re)started > [ 7.478627] d-con r0: conn( Unconnected -> WFConnection ) > [ 7.979066] d-con r0: Handshake successful: Agreed network protocol version 101 > [ 7.979150] d-con r0: conn( WFConnection -> WFReportParams ) > [ 7.979152] d-con r0: Starting asender thread (from drbd_r_r0 [980]) > [ 7.979342] block drbd0: drbd_sync_handshake: > [ 7.979345] block drbd0: self 44F1F08DBF5F3F58:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1024 flags:0 > [ 7.979347] block drbd0: peer 263D532088F42DC9:4EAEF009CE66D738:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1 flags:0 > [ 7.979349] block drbd0: uuid_compare()=100 by rule 90 > [ 7.979351] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 > [ 7.980176] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) > [ 7.980186] block drbd0: Split-Brain detected but unresolved, dropping connection! > [ 7.980502] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 > [ 7.981054] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) > [ 7.981070] d-con r0: conn( WFReportParams -> Disconnecting ) > [ 7.981072] d-con r0: error receiving ReportState, e: -5 l: 0! > [ 7.981272] d-con r0: asender terminated > [ 7.981273] d-con r0: Terminating drbd_a_r0 > [ 7.981410] d-con r0: Connection closed > [ 7.981416] d-con r0: conn( Disconnecting -> StandAlone ) > [ 7.981417] d-con r0: receiver terminated > [ 7.981418] d-con r0: Terminating drbd_r_r0 > > > Is this the expected behavior when no fencing or stonith is enabled > in my two cluster node system? > > I have seen this posting, but the help didn't solve my problem. > > http://serverfault.com/questions/663106/split-brain-on-drbd-and-pacemaker-cluster > > best regards > Waldemar > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?