[DRBD-user] Hang during cluster-wide state change

Matias Ezequiel Vara Larsen matiasevara at gmail.com
Thu Jun 9 14:46:24 CEST 2022


Hello, to complete previous logs (xosan1), I attach kern.log from xosan3, this may
help to understand the issue:

[27911.178915] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: meta
connection shut down by peer.  
[27911.178932] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: conn( Connected ->
NetworkFailure ) peer( Secondary -> Unknown ) 
[27911.178934] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan1: pdsk(
UpToDate -> DUnknown ) repl( Established -> Off ) 
[27911.178945] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: ack_receiver terminated


Matias

On Tue, Jun 07, 2022 at 04:31:49PM +0200, Matias Ezequiel Vara Larsen wrote:
> Hello,
> 
> I am observing the following behavior during the cluster-wide state change
> (kern.log):
> 
> [27050.375909] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
> cluster-wide state change 2579620917 (0->-1 3/1) 
> [27080.486801] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
> change 2579620917 (30112ms) rv = -23
> [27081.671615] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
> cluster-wide state change 3642399026 (0->-1 3/1)
> [27113.253606] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
> change 3642399026 (31584ms) rv = -23
> 
> Here rv=-13 indicates SS_TIMEOUT. This continues for a while. Then, error
> changes to rv=-19 that indicates SS_CONCURRENT_ST_CH:
> 
> [27914.141426] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
> cluster-wide state change 3371883425 (0->-1 3/1) 
> [27926.280093] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: sock_recvmsg returned
> -113 
> [27926.280099] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
> conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) 
> [27926.280101] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: pdsk(
> UpToDate -> DUnknown ) repl( Established -> Off ) 
> [27926.280167] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: ack_receiver terminated
> [27926.280168] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
> Terminating ack_recv thread 
> [27926.280189] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
> change 3371883425 (12140ms) rv = -19 
> [27926.280252] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change failed: In
> transient state, retry after next state change
> 
> State change continues failing but with -23:
> 
> [27926.328265] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
> cluster-wide state change 3899801398 (0->-1 3/1) 
> [27926.329319] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Connection closed
> [27926.329326] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
> helper command: /sbin/drbdadm disconnected 
> [27926.332926] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: helper command:
> /sbin/drbdadm disconnected exit code 0 
> [27926.332955] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( BrokenPipe ->
> Unconnected ) 
> [27926.332980] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Restarting receiver
> thread 
> [27956.998808] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e:
> Aborting cluster-wide state change 3899801398 (30672ms) rv = -23 
> [27956.998973]
> drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( Unconnected
> -> Connecting )
> 
> Finally, the state changes:
> 
> [27957.686777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
> cluster-wide state change 2682176118 (0->2 499/146) 
> [27957.710748] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002:
> discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards
> [27957.710774] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002
> xosan3: drbd_sync_handshake: 
> [27957.710775] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: self
> 259E390F9A174E68:0000000000000000:78D7B860591E8E4E:0000000000000000 bits:0
> flags:120 
> [27957.710777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0
> drbd1002 xosan3: peer
> 259E390F9A174E68:0000000000000000:0000000000000000:0000000000000000 bits:0
> flags:120 
> [27957.710778] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0
> drbd1002 xosan3: uuid_compare()=no-sync by rule=reconnected 
> [27957.711207] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change 2682176118:
> primary_nodes=0, weak_nodes=0 
> [27957.711208] drbd
> xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Committing cluster-wide state
> change 2682176118 (24ms)
> 
> The state successes to change after nearly 20 minutes. Do you know what is going
> on or how could I debug this issue? I could provide more information if
> required.
> 
> Thanks, Matias.


More information about the drbd-user mailing list