[DRBD-user] Hang during cluster-wide state change

Matias Ezequiel Vara Larsen matiasevara at gmail.com
Tue Jun 7 16:31:49 CEST 2022


Hello,

I am observing the following behavior during the cluster-wide state change
(kern.log):

[27050.375909] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
cluster-wide state change 2579620917 (0->-1 3/1) 
[27080.486801] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
change 2579620917 (30112ms) rv = -23
[27081.671615] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
cluster-wide state change 3642399026 (0->-1 3/1)
[27113.253606] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
change 3642399026 (31584ms) rv = -23

Here rv=-13 indicates SS_TIMEOUT. This continues for a while. Then, error
changes to rv=-19 that indicates SS_CONCURRENT_ST_CH:

[27914.141426] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
cluster-wide state change 3371883425 (0->-1 3/1) 
[27926.280093] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: sock_recvmsg returned
-113 
[27926.280099] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) 
[27926.280101] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: pdsk(
UpToDate -> DUnknown ) repl( Established -> Off ) 
[27926.280167] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: ack_receiver terminated
[27926.280168] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
Terminating ack_recv thread 
[27926.280189] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state
change 3371883425 (12140ms) rv = -19 
[27926.280252] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change failed: In
transient state, retry after next state change

State change continues failing but with -23:

[27926.328265] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
cluster-wide state change 3899801398 (0->-1 3/1) 
[27926.329319] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Connection closed
[27926.329326] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3:
helper command: /sbin/drbdadm disconnected 
[27926.332926] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: helper command:
/sbin/drbdadm disconnected exit code 0 
[27926.332955] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( BrokenPipe ->
Unconnected ) 
[27926.332980] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Restarting receiver
thread 
[27956.998808] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e:
Aborting cluster-wide state change 3899801398 (30672ms) rv = -23 
[27956.998973]
drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( Unconnected
-> Connecting )

Finally, the state changes:

[27957.686777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing
cluster-wide state change 2682176118 (0->2 499/146) 
[27957.710748] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002:
discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards
[27957.710774] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002
xosan3: drbd_sync_handshake: 
[27957.710775] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: self
259E390F9A174E68:0000000000000000:78D7B860591E8E4E:0000000000000000 bits:0
flags:120 
[27957.710777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0
drbd1002 xosan3: peer
259E390F9A174E68:0000000000000000:0000000000000000:0000000000000000 bits:0
flags:120 
[27957.710778] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0
drbd1002 xosan3: uuid_compare()=no-sync by rule=reconnected 
[27957.711207] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change 2682176118:
primary_nodes=0, weak_nodes=0 
[27957.711208] drbd
xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Committing cluster-wide state
change 2682176118 (24ms)

The state successes to change after nearly 20 minutes. Do you know what is going
on or how could I debug this issue? I could provide more information if
required.

Thanks, Matias.


More information about the drbd-user mailing list