[DRBD-user] Occasional split-brain at boot time

JCA 1.41421 at gmail.com
Wed Apr 17 18:20:44 CEST 2019


I have a two-node cluster, in the way of two CentOS 7 VMs, as follows:

Cluster name: ClusterOne
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Wed Apr 17 09:43:42 2019
Last change: Wed Apr 17 09:39:52 2019 by root via cibadmin on one

2 nodes configured
4 resources configured

Online: [ one two ]

Full list of resources:

 MyAppCluster (ocf::myapps:MyApp): Started one
 Master/Slave Set: DrbdDataClone [DrbdData]
     Masters: [ one ]
     Slaves: [ two ]
 DrbdFS (ocf::heartbeat:Filesystem): Started one

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

The DRBD software that I using is the following:

drbd84-utils.x86_64                    9.6.0-1.el7.elrepo
 @elrepo
kmod-drbd84.x86_64                 8.4.11-1.1.el7_6.elrepo        @elrepo

The nodes have been configured to share an ext4 partition, DrbdFS has been
configured to start before MyAppCluster, and the ClusterOne cluster has
been configured to start automatically at boot time.

The setup above works, in that node two takes over from node one when the
latter becomes unreachable, and the DrbdFS filesystem automatically becomes
available to node two, at the correct mount point, in that situation.

Now when I reboot one and two, occasionally - but often enough to make me
feel uneasy - DrbdFS comes up in a split-brain condition. What follows are
the boot time syslog traces I typically get in such a case:

Apr 17 09:35:59 one pengine[3663]:  notice:  * Start      ClusterOne    (
         one )
Apr 17 09:35:59 one pengine[3663]:  notice:  * Start      DrbdFS
 (          one )
Apr 17 09:35:59 one pengine[3663]:  notice: Calculated transition 4, saving
inputs in /var/lib/pacemaker/pengine/pe-input-560.bz2
Apr 17 09:35:59 one crmd[3664]:  notice: Initiating monitor operation
DrbdData_monitor_30000 on two
Apr 17 09:35:59 one crmd[3664]:  notice: Initiating start operation
DrbdFS_start_0 locally on one
Apr 17 09:35:59 one kernel: drbd myapp-data: Handshake successful: Agreed
network protocol version 101
Apr 17 09:35:59 one kernel: drbd myapp-data: Feature flags enabled on
protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFConnection ->
WFReportParams )
Apr 17 09:35:59 one kernel: drbd myapp-data: Starting ack_recv thread (from
drbd_r_myapp-dat [4406])
Apr 17 09:35:59 one kernel: block drbd1: drbd_sync_handshake:
Apr 17 09:35:59 one kernel: block drbd1: self
002DDA8B166FC899:8DD977B102052FD2:
BE3891694D7BCD54:BE3791694D7BCD54 bits:0 flags:0
Apr 17 09:35:59 one kernel: block drbd1: peer
D12D1947C4ECF940:8DD977B102052FD2:
BE3891694D7BCD54:BE3791694D7BCD54 bits:32 flags:0
Apr 17 09:35:59 one kernel: block drbd1: uuid_compare()=100 by rule 90
Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1
Apr 17 09:35:59 one Filesystem(DrbdFS)[4531]: INFO: Running start for
/dev/drbd1 on /var/lib/myapp
Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
initial-split-brain minor-1 exit code 0 (0x0)
Apr 17 09:35:59 one kernel: block drbd1: Split-Brain detected but
unresolved, dropping connection!
Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1
Apr 17 09:35:59 one kernel: drbd myapp-data: meta connection shut down by
peer.
Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFReportParams ->
NetworkFailure )
Apr 17 09:35:59 one kernel: drbd myapp-data: ack_receiver terminated
Apr 17 09:35:59 one kernel: drbd myapp-data: Terminating drbd_a_myapp-dat
Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
split-brain minor-1 exit code 0 (0x0)

   Fixing the problem is not difficult, by manual intervention, once both
nodes are up and running. However, I would like to understand why the
split-brain condition takes sometimes on booting up and,  more importantly,
how to prevent this from happening, if at all possible.

   Suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190417/f11e6d40/attachment.htm>


More information about the drbd-user mailing list