[DRBD-user] split-brain on Ubuntu 14.04 LTS after reboot of master node

Sat Nov 14 19:17:27 CET 2015

On 13/11/15 11:12 AM, Waldemar Brodkorb wrote:
> Hi,
> 
> I am struggling with a problem since two days and found no solution,
> yet. I think it might be something trivially simple I am overlooking.
> 
> I have two fresh Ubuntu 14.04.3 systems installed in Qemu. (I can
> provide the disk images on request, if anyone needs it to show me
> the problem).
> 
> The following software is installed:
> drbd8-utils 2:8.4.4-1ubuntu1
> pacemaker 1.1.10+git20130802-1ubuntu2.3
> corosync 2.3.3-1ubuntu1
> 
> I am using the LTS trusty kernel 3.13.0-68-generic.
> The drbd initscript is disabled. (update-rc.d -f drbd remove).
> 
> I have the attached corosync.conf on both nodes.
> My DRBD resource r0 looks like:
> resource r0 {
>         device    /dev/drbd0 minor 0;
>         disk      /dev/sdb1;
>         meta-disk internal;
>         on drbd01 {
>                 address 10.20.42.71:7780;
>         }
>         on drbd02 {
>                 address 10.20.42.72:7780;
>         }
> }
> 
> I haven't changed anything in /etc/drbd.d/global_common.conf.
> 
> My CRM configuration is simple and nearly the same as the example in
> the DRBD manual without MySQL:
> node $id="169093703" drbd01
> node $id="169093704" drbd02
> primitive p_drbd ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="29s" role="Master" \
>         op monitor interval="31s" role="Slave"
> primitive p_filesystem ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/drbd" fstype="ext4"
> primitive p_sharedip ocf:heartbeat:IPaddr2 \
>         params ip="10.20.42.70" nic="eth0"
> group grp_drbd p_filesystem p_sharedip
> ms ms_drbd p_drbd \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation ip_on_drbd inf: grp_drbd ms_drbd:Master
> order ip_after_drbd inf: ms_drbd:promote grp_drbd:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.10-42f2063" \
>         cluster-infrastructure="corosync" \
>         stonith-enabled="false" \

And here's the core of the problem.

Configure and test stonith in pacemaker. Then, configure drbd to use
'fencing resource-and-stonith;' and configure 'crm-{un,}fence-peer.sh as
the {un,}fence handlers.

>         no-quorum-policy="ignore"
> 
> All looks good to me looking into crm_mon:
> Last updated: Fri Nov 13 17:00:40 2015
> Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01
> Stack: corosync
> Current DC: drbd01 (169093703) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ drbd01 drbd02 ]
> 
>  Master/Slave Set: ms_drbd [p_drbd]
>      Masters: [ drbd01 ]           
>      Slaves: [ drbd02 ] 
>  Resource Group: grp_drbd
>      p_filesystem       (ocf::heartbeat:Filesystem):    Started
> drbd01
>      p_sharedip (ocf::heartbeat:IPaddr2):       Started drbd01        
> 
> The DRBD is fine, too:
> root at drbd01:~# cat /proc/drbd 
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: 6551AD2C98F533733BE558C 
>  0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:4096 nr:0 dw:4 dr:4841 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> 
> I then reboot drbd01 and the failover works great:
> Last updated: Fri Nov 13 17:02:32 2015
> Last change: Fri Nov 13 16:37:39 2015 via cibadmin on drbd01
> Stack: corosync
> Current DC: drbd02 (169093704) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ drbd01 drbd02 ]
> 
>  Master/Slave Set: ms_drbd [p_drbd]
>      Masters: [ drbd02 ]           
>      Slaves: [ drbd01 ] 
>  Resource Group: grp_drbd
>      p_filesystem       (ocf::heartbeat:Filesystem):    Started
> drbd02
>      p_sharedip (ocf::heartbeat:IPaddr2):       Started drbd02        
> 
> Everything looks nice in the CRM perspective.
> 
> But when I reconnect into drbd01 I see a unresolved split-brain:
> cat /proc/drbd 
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: 6551AD2C98F533733BE558C 
>  0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4096
> 
> With following dmesg output:
> [    7.430374] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101)
> [    7.430376] drbd: srcversion: 6551AD2C98F533733BE558C 
> [    7.430377] drbd: registered as block device major 147
> [    7.468725] d-con r0: Starting worker thread (from drbdsetup [970])
> [    7.469322] block drbd0: disk( Diskless -> Attaching ) 
> [    7.469426] d-con r0: Method to ensure write ordering: flush
> [    7.469428] block drbd0: max BIO size = 1048576
> [    7.469432] block drbd0: drbd_bm_resize called with capacity == 4192056
> [    7.469440] block drbd0: resync bitmap: bits=524007 words=8188 pages=16
> [    7.469442] block drbd0: size = 2047 MB (2096028 KB)
> [    7.469976] block drbd0: bitmap READ of 16 pages took 0 jiffies
> [    7.469986] block drbd0: recounting of set bits took additional 0 jiffies
> [    7.469987] block drbd0: 4096 KB (1024 bits) marked out-of-sync by on disk bit-map.
> [    7.470001] block drbd0: disk( Attaching -> UpToDate ) 
> [    7.470003] block drbd0: attached to UUIDs 44F1F08DBF5F3F59:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8
> [    7.477742] d-con r0: conn( StandAlone -> Unconnected ) 
> [    7.477753] d-con r0: Starting receiver thread (from drbd_w_r0 [971])
> [    7.478619] d-con r0: receiver (re)started
> [    7.478627] d-con r0: conn( Unconnected -> WFConnection ) 
> [    7.979066] d-con r0: Handshake successful: Agreed network protocol version 101
> [    7.979150] d-con r0: conn( WFConnection -> WFReportParams ) 
> [    7.979152] d-con r0: Starting asender thread (from drbd_r_r0 [980])
> [    7.979342] block drbd0: drbd_sync_handshake:
> [    7.979345] block drbd0: self 44F1F08DBF5F3F58:4EAEF009CE66D739:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1024 flags:0
> [    7.979347] block drbd0: peer 263D532088F42DC9:4EAEF009CE66D738:AF01AF11C6E607E8:AF00AF11C6E607E8 bits:1 flags:0
> [    7.979349] block drbd0: uuid_compare()=100 by rule 90
> [    7.979351] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
> [    7.980176] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
> [    7.980186] block drbd0: Split-Brain detected but unresolved, dropping connection!
> [    7.980502] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
> [    7.981054] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> [    7.981070] d-con r0: conn( WFReportParams -> Disconnecting ) 
> [    7.981072] d-con r0: error receiving ReportState, e: -5 l: 0!
> [    7.981272] d-con r0: asender terminated
> [    7.981273] d-con r0: Terminating drbd_a_r0
> [    7.981410] d-con r0: Connection closed
> [    7.981416] d-con r0: conn( Disconnecting -> StandAlone ) 
> [    7.981417] d-con r0: receiver terminated
> [    7.981418] d-con r0: Terminating drbd_r_r0
> 
> 
> Is this the expected behavior when no fencing or stonith is enabled
> in my two cluster node system?
> 
> I have seen this posting, but the help didn't solve my problem.
> 
> http://serverfault.com/questions/663106/split-brain-on-drbd-and-pacemaker-cluster
> 
> best regards
>  Waldemar
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?