[DRBD-user] Occasional split-brain at boot time

Wed Apr 17 21:46:56 CEST 2019

On 2019-04-17 12:20 p.m., JCA wrote:
> I have a two-node cluster, in the way of two CentOS 7 VMs, as follows:
>
> Cluster name: ClusterOne
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with 
> quorum
> Last updated: Wed Apr 17 09:43:42 2019
> Last change: Wed Apr 17 09:39:52 2019 by root via cibadmin on one
>
> 2 nodes configured
> 4 resources configured
>
> Online: [ one two ]
>
> Full list of resources:
>
>  MyAppCluster(ocf::myapps:MyApp):Started one
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ one ]
>      Slaves: [ two ]
>  DrbdFS(ocf::heartbeat:Filesystem):Started one
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> The DRBD software that I using is the following:
>
> drbd84-utils.x86_64 9.6.0-1.el7.elrepo                 @elrepo
> kmod-drbd84.x86_64  8.4.11-1.1.el7_6.elrepo        @elrepo
>
> The nodes have been configured to share an ext4 partition, DrbdFS has 
> been configured to start before MyAppCluster, and the ClusterOne 
> cluster has been configured to start automatically at boot time.
>
> The setup above works, in that node two takes over from node one when 
> the latter becomes unreachable, and the DrbdFS filesystem 
> automatically becomes available to node two, at the correct mount 
> point, in that situation.
>
> Now when I reboot one and two, occasionally - but often enough to make 
> me feel uneasy - DrbdFS comes up in a split-brain condition. What 
> follows are the boot time syslog traces I typically get in such a case:
>
> Apr 17 09:35:59 one pengine[3663]:  notice:  * Start      ClusterOne  
>   (           one )
> Apr 17 09:35:59 one pengine[3663]:  notice:  * Start      DrbdFS      
>      (          one )
> Apr 17 09:35:59 one pengine[3663]:  notice: Calculated transition 4, 
> saving inputs in /var/lib/pacemaker/pengine/pe-input-560.bz2
> Apr 17 09:35:59 one crmd[3664]:  notice: Initiating monitor operation 
> DrbdData_monitor_30000 on two
> Apr 17 09:35:59 one crmd[3664]:  notice: Initiating start operation 
> DrbdFS_start_0 locally on one
> Apr 17 09:35:59 one kernel: drbd myapp-data: Handshake successful: 
> Agreed network protocol version 101
> Apr 17 09:35:59 one kernel: drbd myapp-data: Feature flags enabled on 
> protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
> Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFConnection -> 
> WFReportParams )
> Apr 17 09:35:59 one kernel: drbd myapp-data: Starting ack_recv thread 
> (from drbd_r_myapp-dat [4406])
> Apr 17 09:35:59 one kernel: block drbd1: drbd_sync_handshake:
> Apr 17 09:35:59 one kernel: block drbd1: self 
> 002DDA8B166FC899:8DD977B102052FD2:
> BE3891694D7BCD54:BE3791694D7BCD54 bits:0 flags:0
> Apr 17 09:35:59 one kernel: block drbd1: peer 
> D12D1947C4ECF940:8DD977B102052FD2:
> BE3891694D7BCD54:BE3791694D7BCD54 bits:32 flags:0
> Apr 17 09:35:59 one kernel: block drbd1: uuid_compare()=100 by rule 90
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm 
> initial-split-brain minor-1
> Apr 17 09:35:59 one Filesystem(DrbdFS)[4531]: INFO: Running start for 
> /dev/drbd1 on /var/lib/myapp
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm 
> initial-split-brain minor-1 exit code 0 (0x0)
> Apr 17 09:35:59 one kernel: block drbd1: Split-Brain detected but 
> unresolved, dropping connection!
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm 
> split-brain minor-1
> Apr 17 09:35:59 one kernel: drbd myapp-data: meta connection shut down 
> by peer.
> Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFReportParams -> 
> NetworkFailure )
> Apr 17 09:35:59 one kernel: drbd myapp-data: ack_receiver terminated
> Apr 17 09:35:59 one kernel: drbd myapp-data: Terminating drbd_a_myapp-dat
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm 
> split-brain minor-1 exit code 0 (0x0)
>
>    Fixing the problem is not difficult, by manual intervention, once 
> both nodes are up and running. However, I would like to understand why 
> the split-brain condition takes sometimes on booting up and,  more 
> importantly, how to prevent this from happening, if at all possible.
>
>    Suggestions?

Stonith in pacemaker, once tested, fencing in DRBD. This is what fencing 
is for.

digimer

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190417/3daa2e5c/attachment-0001.htm>