[DRBD-user] Cluster split after short network outage

Veit Wahlich cru.lists at zodia.de
Thu Jul 12 16:04:22 CEST 2018


Hi Roman,

what you experienced is the expected behaviour of a primary-primary
setup when the nodes are being disconnected from each other. It is
called split-brain situation and ensures that data stays
available/accessible on both sides without further corruption.

Usually you want to set up a STONITH configuration that performs a hard
shut-down or at least a network isolation of one of the hosts if such a
situation occurs, so the surviving side is free to restart the services
that resided on the other side before the split-brain occured.

You also might want to set up redundant networking, especially when
running a primary-primary configuration.

To resolve the split-brain, you need to dismiss the data of one side by
forcing a resync with the other side as source. If you have data changes
on both sides, you might want to copy the changes from the side to be
discarded the the future source first, usually at file level, or in case
of a shared LVM-PV, on LV level.

You might also want to reconsider, whether is primary-primary
configuration really suits your needs best. 

Best regards,
// Veit

Am Donnerstag, den 12.07.2018, 12:52 +0300 schrieb Roman Makhov:
> Hello,
> 
> I discovered the "Cluster is now split" message in log and moving to
> StandAlone then after short (about 8 seconds) network failure between
> cluster nodes.
> 
> Would you please to suggest something?
> 
> Thank you in advance!
> 
> The drbd log is:
> =====================================================================
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: PingAck did not
> arrive in time.
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn( Connected
> -> NetworkFailure ) peer( Primary -> Unknown )
> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Consistent )
> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
> UpToDate -> DUnknown ) repl( Established -> Off )
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: ack_receiver terminated
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Terminating
> ack_recv thread
> [Sat Jul  7 21:02:22 2018] drbd dhcp: Preparing cluster-wide state
> change 1083152536 (1->-1 0/0)
> [Sat Jul  7 21:02:22 2018] drbd dhcp: Committing cluster-wide state
> change 1083152536 (2ms)
> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0: disk( Consistent -> UpToDate )
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Connection closed
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn(
> NetworkFailure -> Unconnected )
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Restarting
> receiver thread
> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn(
> Unconnected -> Connecting )
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Handshake to
> peer 0 successful: Agreed network protocol version 112
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Feature flags
> enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Starting
> ack_recv thread (from drbd_r_dhcp [28952])
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Preparing
> remote state change 1152846943 (primary_nodes=0, weak_nodes=0)
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Committing
> remote state change 1152846943
> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: conn(
> Connecting -> Connected ) peer( Unknown -> Primary )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Outdated )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp:
> drbd_sync_handshake:
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: self
> 0967822D6718C8AC:0000000000000000:323BE7D71FABECCC:44CD99B02FF92950
> bits:0 flags:120
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp:
> uuid_compare()=-2 by rule 50
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
> DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: receive
> bitmap stats [Bytes(packets)]: plain 0(0), RLE 29(1), total 29;
> compression: 100.0%
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: send
> bitmap stats [Bytes(packets)]: plain 0(0), RLE 29(1), total 29;
> compression: 100.0%
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
> command: /sbin/drbdadm before-resync-target
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
> command: /sbin/drbdadm before-resync-target exit code 0 (0x0)
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( Outdated -> Inconsistent )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: repl(
> WFBitMapT -> SyncTarget )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: Began
> resync as SyncTarget (will sync 12 KB [3 bits set]).
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: Resync
> done (total 1 sec; paused 0 sec; 12 K/sec)
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: updated
> UUIDs 64F87B33597A7F68:0000000000000000:0967822D6718C8AC:323BE7D71FABECCC
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( Inconsistent -> UpToDate )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: repl(
> SyncTarget -> Established )
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
> command: /sbin/drbdadm after-resync-target
> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
> command: /sbin/drbdadm after-resync-target exit code 0 (0x0)
> [Sat Jul  7 21:02:38 2018] drbd dhcp: Preparing cluster-wide state
> change 3007526931 (1->0 496/16)
> [Sat Jul  7 21:02:38 2018] drbd dhcp: State change 3007526931:
> primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFE
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Cluster is now split
> [Sat Jul  7 21:02:38 2018] drbd dhcp: Committing cluster-wide state
> change 3007526931 (50ms)
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: conn( Connected
> -> Disconnecting ) peer( Primary -> Unknown )
> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Outdated )
> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
> UpToDate -> DUnknown ) repl( Established -> Off )
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: ack_receiver terminated
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating
> ack_recv thread
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Connection closed
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: conn(
> Disconnecting -> StandAlone )
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating
> receiver thread
> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating sender thread
> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( Outdated -> Detaching )
> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( Detaching -> Diskless )
> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: drbd_bm_resize called
> with capacity == 0
> [Sat Jul  7 21:02:38 2018] drbd dhcp: Terminating worker thread
> =====================================================================
> 
> The version is:
> =====================================================================
> [root at dhcp-slave ~]# drbdadm --version
> DRBDADM_BUILDTAG=GIT-hash:\ 3fc9d321f9f60e84d7dcdaac545fe8be4b280a63\
> build\ by\ mockbuild@\,\ 2017-09-14\ 17:52:30
> DRBDADM_API_VERSION=2
> DRBD_KERNEL_VERSION_CODE=0x090009
> DRBD_KERNEL_VERSION=9.0.9
> DRBDADM_VERSION_CODE=0x090100
> DRBDADM_VERSION=9.1.0
> =====================================================================
> 
> Roman.
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user




More information about the drbd-user mailing list