[DRBD-user] Occasional split-brain at boot time
digimer
lists at alteeve.ca
Wed Apr 17 21:46:56 CEST 2019
On 2019-04-17 12:20 p.m., JCA wrote:
> I have a two-node cluster, in the way of two CentOS 7 VMs, as follows:
>
> Cluster name: ClusterOne
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
> quorum
> Last updated: Wed Apr 17 09:43:42 2019
> Last change: Wed Apr 17 09:39:52 2019 by root via cibadmin on one
>
> 2 nodes configured
> 4 resources configured
>
> Online: [ one two ]
>
> Full list of resources:
>
> MyAppCluster(ocf::myapps:MyApp):Started one
> Master/Slave Set: DrbdDataClone [DrbdData]
> Masters: [ one ]
> Slaves: [ two ]
> DrbdFS(ocf::heartbeat:Filesystem):Started one
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> The DRBD software that I using is the following:
>
> drbd84-utils.x86_64 9.6.0-1.el7.elrepo @elrepo
> kmod-drbd84.x86_64 8.4.11-1.1.el7_6.elrepo @elrepo
>
> The nodes have been configured to share an ext4 partition, DrbdFS has
> been configured to start before MyAppCluster, and the ClusterOne
> cluster has been configured to start automatically at boot time.
>
> The setup above works, in that node two takes over from node one when
> the latter becomes unreachable, and the DrbdFS filesystem
> automatically becomes available to node two, at the correct mount
> point, in that situation.
>
> Now when I reboot one and two, occasionally - but often enough to make
> me feel uneasy - DrbdFS comes up in a split-brain condition. What
> follows are the boot time syslog traces I typically get in such a case:
>
> Apr 17 09:35:59 one pengine[3663]: notice: * Start ClusterOne
> ( one )
> Apr 17 09:35:59 one pengine[3663]: notice: * Start DrbdFS
> ( one )
> Apr 17 09:35:59 one pengine[3663]: notice: Calculated transition 4,
> saving inputs in /var/lib/pacemaker/pengine/pe-input-560.bz2
> Apr 17 09:35:59 one crmd[3664]: notice: Initiating monitor operation
> DrbdData_monitor_30000 on two
> Apr 17 09:35:59 one crmd[3664]: notice: Initiating start operation
> DrbdFS_start_0 locally on one
> Apr 17 09:35:59 one kernel: drbd myapp-data: Handshake successful:
> Agreed network protocol version 101
> Apr 17 09:35:59 one kernel: drbd myapp-data: Feature flags enabled on
> protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
> Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFConnection ->
> WFReportParams )
> Apr 17 09:35:59 one kernel: drbd myapp-data: Starting ack_recv thread
> (from drbd_r_myapp-dat [4406])
> Apr 17 09:35:59 one kernel: block drbd1: drbd_sync_handshake:
> Apr 17 09:35:59 one kernel: block drbd1: self
> 002DDA8B166FC899:8DD977B102052FD2:
> BE3891694D7BCD54:BE3791694D7BCD54 bits:0 flags:0
> Apr 17 09:35:59 one kernel: block drbd1: peer
> D12D1947C4ECF940:8DD977B102052FD2:
> BE3891694D7BCD54:BE3791694D7BCD54 bits:32 flags:0
> Apr 17 09:35:59 one kernel: block drbd1: uuid_compare()=100 by rule 90
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
> initial-split-brain minor-1
> Apr 17 09:35:59 one Filesystem(DrbdFS)[4531]: INFO: Running start for
> /dev/drbd1 on /var/lib/myapp
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
> initial-split-brain minor-1 exit code 0 (0x0)
> Apr 17 09:35:59 one kernel: block drbd1: Split-Brain detected but
> unresolved, dropping connection!
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
> split-brain minor-1
> Apr 17 09:35:59 one kernel: drbd myapp-data: meta connection shut down
> by peer.
> Apr 17 09:35:59 one kernel: drbd myapp-data: conn( WFReportParams ->
> NetworkFailure )
> Apr 17 09:35:59 one kernel: drbd myapp-data: ack_receiver terminated
> Apr 17 09:35:59 one kernel: drbd myapp-data: Terminating drbd_a_myapp-dat
> Apr 17 09:35:59 one kernel: block drbd1: helper command: /sbin/drbdadm
> split-brain minor-1 exit code 0 (0x0)
>
> Fixing the problem is not difficult, by manual intervention, once
> both nodes are up and running. However, I would like to understand why
> the split-brain condition takes sometimes on booting up and, more
> importantly, how to prevent this from happening, if at all possible.
>
> Suggestions?
Stonith in pacemaker, once tested, fencing in DRBD. This is what fencing
is for.
digimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190417/3daa2e5c/attachment-0001.htm>
More information about the drbd-user
mailing list