Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 13/11/15 11:12 AM, Waldemar Brodkorb wrote:
> Hi,
>
> I am struggling with a problem since two days and found no solution,
> yet. I think it might be something trivially simple I am overlooking.
>
> I have two fresh Ubuntu 14.04.3 systems installed in Qemu. (I can
> provide the disk images on request, if anyone needs it to show me
> the problem).
>
> The following software is installed:
> drbd8-utils 2:8.4.4-1ubuntu1
> pacemaker 1.1.10+git20130802-1ubuntu2.3
> corosync 2.3.3-1ubuntu1
>
> I am using the LTS trusty kernel 3.13.0-68-generic.
> The drbd initscript is disabled. (update-rc.d -f drbd remove).
>
> I have the attached corosync.conf on both nodes.
> My DRBD resource r0 looks like:
> resource r0 {
> device /dev/drbd0 minor 0;
> disk /dev/sdb1;
> meta-disk internal;
> on drbd01 {
> address 10.20.42.71:7780;
> }
> on drbd02 {
> address 10.20.42.72:7780;
> }
> }
>
> I haven't changed anything in /etc/drbd.d/global_common.conf.
>
> My CRM configuration is simple and nearly the same as the example in
> the DRBD manual without MySQL:
> node $id="169093703" drbd01
> node $id="169093704" drbd02
> primitive p_drbd ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="29s" role="Master" \
> op monitor interval="31s" role="Slave"
> primitive p_filesystem ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/drbd" fstype="ext4"
> primitive p_sharedip ocf:heartbeat:IPaddr2 \
> params ip="10.20.42.70" nic="eth0"
> group grp_drbd p_filesystem p_sharedip
> ms ms_drbd p_drbd \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation ip_on_drbd inf: grp_drbd ms_drbd:Master
> order ip_after_drbd inf: ms_drbd:promote grp_drbd:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> stonith-enabled="false" \
And here's the core of the problem.
Configure and test stonith in pacemaker. Then, configure drbd to use
'fencing resource-and-stonith;' and configure 'crm-{un,}fence-peer.sh as
the {un,}fence handlers.
> no-quorum-policy="ignore"
>
> All looks good to me looking into crm_mon:
> Last updated: Fri Nov 13 17:00:40 2015
> Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01
> Stack: corosync
> Current DC: drbd01 (169093703) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 4 Resources configured
>
>
> Online: [ drbd01 drbd02 ]
>
> Master/Slave Set: ms_drbd [p_drbd]
> Masters: [ drbd01 ]
> Slaves: [ drbd02 ]
> Resource Group: grp_drbd
> p_filesystem (ocf::heartbeat:Filesystem): Started
> drbd01
> p_sharedip (ocf::heartbeat:IPaddr2): Started drbd01
>
> The DRBD is fine, too:
> root at drbd01:~# cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: 6551AD2C98F533733BE558C
> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
> ns:4096 nr:0 dw:4 dr:4841 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
>
> I then reboot drbd01 and the failover works great:
> Last updated: Fri Nov 13 17:02:32 2015
> Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01
> Stack: corosync
> Current DC: drbd02 (169093704) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 4 Resources configured
>
>
> Online: [ drbd01 drbd02 ]
>
> Master/Slave Set: ms_drbd [p_drbd]
> Masters: [ drbd02 ]
> Slaves: [ drbd01 ]
> Resource Group: grp_drbd
> p_filesystem (ocf::heartbeat:Filesystem): Started
> drbd02
> p_sharedip (ocf::heartbeat:IPaddr2): Started drbd02
>
> Everything looks nice in the CRM perspective.
>
> But when I reconnect into drbd01 I see a unresolved split-brain:
> cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: 6551AD2C98F533733BE558C
> 0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r-----
> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4096
>
> With following dmesg output:
> [ 7.430374] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101)
> [ 7.430376] drbd: srcversion: 6551AD2C98F533733BE558C
> [ 7.430377] drbd: registered as block device major 147
> [ 7.468725] d-con r0: Starting worker thread (from drbdsetup [970])
> [ 7.469322] block drbd0: disk( Diskless -> Attaching )
> [ 7.469426] d-con r0: Method to ensure write ordering: flush
> [ 7.469428] block drbd0: max BIO size = 1048576
> [ 7.469432] block drbd0: drbd_bm_resize called with capacity == 4192056
> [ 7.469440] block drbd0: resync bitmap: bits=524007 words=8188 pages=16
> [ 7.469442] block drbd0: size = 2047 MB (2096028 KB)
> [ 7.469976] block drbd0: bitmap READ of 16 pages took 0 jiffies
> [ 7.469986] block drbd0: recounting of set bits took additional 0 jiffies
> [ 7.469987] block drbd0: 4096 KB (1024 bits) marked out-of-sync by on disk bit-map.
> [ 7.470001] block drbd0: disk( Attaching -> UpToDate )
> [ 7.470003] block drbd0: attached to UUIDs 44F1F08DBF5F3F59:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8
> [ 7.477742] d-con r0: conn( StandAlone -> Unconnected )
> [ 7.477753] d-con r0: Starting receiver thread (from drbd_w_r0 [971])
> [ 7.478619] d-con r0: receiver (re)started
> [ 7.478627] d-con r0: conn( Unconnected -> WFConnection )
> [ 7.979066] d-con r0: Handshake successful: Agreed network protocol version 101
> [ 7.979150] d-con r0: conn( WFConnection -> WFReportParams )
> [ 7.979152] d-con r0: Starting asender thread (from drbd_r_r0 [980])
> [ 7.979342] block drbd0: drbd_sync_handshake:
> [ 7.979345] block drbd0: self 44F1F08DBF5F3F58:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1024 flags:0
> [ 7.979347] block drbd0: peer 263D532088F42DC9:4EAEF009CE66D738:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1 flags:0
> [ 7.979349] block drbd0: uuid_compare()=100 by rule 90
> [ 7.979351] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
> [ 7.980176] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
> [ 7.980186] block drbd0: Split-Brain detected but unresolved, dropping connection!
> [ 7.980502] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
> [ 7.981054] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> [ 7.981070] d-con r0: conn( WFReportParams -> Disconnecting )
> [ 7.981072] d-con r0: error receiving ReportState, e: -5 l: 0!
> [ 7.981272] d-con r0: asender terminated
> [ 7.981273] d-con r0: Terminating drbd_a_r0
> [ 7.981410] d-con r0: Connection closed
> [ 7.981416] d-con r0: conn( Disconnecting -> StandAlone )
> [ 7.981417] d-con r0: receiver terminated
> [ 7.981418] d-con r0: Terminating drbd_r_r0
>
>
> Is this the expected behavior when no fencing or stonith is enabled
> in my two cluster node system?
>
> I have seen this posting, but the help didn't solve my problem.
>
> http://serverfault.com/questions/663106/split-brain-on-drbd-and-pacemaker-cluster
>
> best regards
> Waldemar
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?