[DRBD-user] Cluster split after short network outage

Roman Makhov roman.makhov at gmail.com
Wed Jul 25 14:14:03 CEST 2018


Thanks again Veit for detailed description.

I would like to save here the root cause of the problem.
The issue happen because DRBD was in Pacemaker resource.
So on network failure Pacemaker promoted the Master role to the Slave
and we got two DRBD Primary nodes in parallel.
On network restore we had the split-brian for two primaries as Veit described.
Automatic split-brain recovery helps to resolve this issue.

Thanks,
Roman.

2018-07-12 17:04 GMT+03:00 Veit Wahlich <cru.lists at zodia.de>:
> Hi Roman,
>
> what you experienced is the expected behaviour of a primary-primary
> setup when the nodes are being disconnected from each other. It is
> called split-brain situation and ensures that data stays
> available/accessible on both sides without further corruption.
>
> Usually you want to set up a STONITH configuration that performs a hard
> shut-down or at least a network isolation of one of the hosts if such a
> situation occurs, so the surviving side is free to restart the services
> that resided on the other side before the split-brain occured.
>
> You also might want to set up redundant networking, especially when
> running a primary-primary configuration.
>
> To resolve the split-brain, you need to dismiss the data of one side by
> forcing a resync with the other side as source. If you have data changes
> on both sides, you might want to copy the changes from the side to be
> discarded the the future source first, usually at file level, or in case
> of a shared LVM-PV, on LV level.
>
> You might also want to reconsider, whether is primary-primary
> configuration really suits your needs best.
>
> Best regards,
> // Veit
>
> Am Donnerstag, den 12.07.2018, 12:52 +0300 schrieb Roman Makhov:
>> Hello,
>>
>> I discovered the "Cluster is now split" message in log and moving to
>> StandAlone then after short (about 8 seconds) network failure between
>> cluster nodes.
>>
>> Would you please to suggest something?
>>
>> Thank you in advance!
>>
>> The drbd log is:
>> =====================================================================
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: PingAck did not
>> arrive in time.
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn( Connected
>> -> NetworkFailure ) peer( Primary -> Unknown )
>> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Consistent )
>> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
>> UpToDate -> DUnknown ) repl( Established -> Off )
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: ack_receiver terminated
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Terminating
>> ack_recv thread
>> [Sat Jul  7 21:02:22 2018] drbd dhcp: Preparing cluster-wide state
>> change 1083152536 (1->-1 0/0)
>> [Sat Jul  7 21:02:22 2018] drbd dhcp: Committing cluster-wide state
>> change 1083152536 (2ms)
>> [Sat Jul  7 21:02:22 2018] drbd dhcp/0 drbd0: disk( Consistent -> UpToDate )
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Connection closed
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn(
>> NetworkFailure -> Unconnected )
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: Restarting
>> receiver thread
>> [Sat Jul  7 21:02:22 2018] drbd dhcp dhcp-master.dhcp: conn(
>> Unconnected -> Connecting )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Handshake to
>> peer 0 successful: Agreed network protocol version 112
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Feature flags
>> enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Starting
>> ack_recv thread (from drbd_r_dhcp [28952])
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Preparing
>> remote state change 1152846943 (primary_nodes=0, weak_nodes=0)
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: Committing
>> remote state change 1152846943
>> [Sat Jul  7 21:02:30 2018] drbd dhcp dhcp-master.dhcp: conn(
>> Connecting -> Connected ) peer( Unknown -> Primary )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Outdated )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp:
>> drbd_sync_handshake:
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: self
>> 0967822D6718C8AC:0000000000000000:323BE7D71FABECCC:44CD99B02FF92950
>> bits:0 flags:120
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp:
>> uuid_compare()=-2 by rule 50
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
>> DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: receive
>> bitmap stats [Bytes(packets)]: plain 0(0), RLE 29(1), total 29;
>> compression: 100.0%
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: send
>> bitmap stats [Bytes(packets)]: plain 0(0), RLE 29(1), total 29;
>> compression: 100.0%
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
>> command: /sbin/drbdadm before-resync-target
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
>> command: /sbin/drbdadm before-resync-target exit code 0 (0x0)
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( Outdated -> Inconsistent )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: repl(
>> WFBitMapT -> SyncTarget )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: Began
>> resync as SyncTarget (will sync 12 KB [3 bits set]).
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: Resync
>> done (total 1 sec; paused 0 sec; 12 K/sec)
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: updated
>> UUIDs 64F87B33597A7F68:0000000000000000:0967822D6718C8AC:323BE7D71FABECCC
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0: disk( Inconsistent -> UpToDate )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: repl(
>> SyncTarget -> Established )
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
>> command: /sbin/drbdadm after-resync-target
>> [Sat Jul  7 21:02:30 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: helper
>> command: /sbin/drbdadm after-resync-target exit code 0 (0x0)
>> [Sat Jul  7 21:02:38 2018] drbd dhcp: Preparing cluster-wide state
>> change 3007526931 (1->0 496/16)
>> [Sat Jul  7 21:02:38 2018] drbd dhcp: State change 3007526931:
>> primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFE
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Cluster is now split
>> [Sat Jul  7 21:02:38 2018] drbd dhcp: Committing cluster-wide state
>> change 3007526931 (50ms)
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: conn( Connected
>> -> Disconnecting ) peer( Primary -> Unknown )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( UpToDate -> Outdated )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0 dhcp-master.dhcp: pdsk(
>> UpToDate -> DUnknown ) repl( Established -> Off )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: ack_receiver terminated
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating
>> ack_recv thread
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Connection closed
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: conn(
>> Disconnecting -> StandAlone )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating
>> receiver thread
>> [Sat Jul  7 21:02:38 2018] drbd dhcp dhcp-master.dhcp: Terminating sender thread
>> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( Outdated -> Detaching )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: disk( Detaching -> Diskless )
>> [Sat Jul  7 21:02:38 2018] drbd dhcp/0 drbd0: drbd_bm_resize called
>> with capacity == 0
>> [Sat Jul  7 21:02:38 2018] drbd dhcp: Terminating worker thread
>> =====================================================================
>>
>> The version is:
>> =====================================================================
>> [root at dhcp-slave ~]# drbdadm --version
>> DRBDADM_BUILDTAG=GIT-hash:\ 3fc9d321f9f60e84d7dcdaac545fe8be4b280a63\
>> build\ by\ mockbuild@\,\ 2017-09-14\ 17:52:30
>> DRBDADM_API_VERSION=2
>> DRBD_KERNEL_VERSION_CODE=0x090009
>> DRBD_KERNEL_VERSION=9.0.9
>> DRBDADM_VERSION_CODE=0x090100
>> DRBDADM_VERSION=9.1.0
>> =====================================================================
>>
>> Roman.
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>


More information about the drbd-user mailing list