[Drbd-dev] Problems with code to disallow connection when peer has
mismatched UUID
Graham, Simon
Simon.Graham at stratus.com
Wed Sep 10 22:13:43 CEST 2008
A change made to DRBD 8.2 on 3/25/08 (commit e817888f) to forcibly
disconnect in receive_uuids in some cases seems to be causing us
problems in some scenarios - the specific sequence is this (see detailed
trace extracts at the end of this)
. We start with node1 secondary and node0 primary and then disconnect
and reconnect
. This causes us to enter PausedSyncT on node1 and PausedSyncS on node0
(because higher priority devices
are syncing)
. Now we swap primaryness so node1 is Primary/PausedSyncT/Inconsistent
and node0 is Secondary/PausedSyncS/UpToDate
. Lose connection again - because Node1 is primary, we create a new
current UUID.
. Now we connect again - the code in receive_uuids on Node1 refuses to
connect because it's Primary/Inconsistent and
the other node is using a different UUID.
. We are now stuck in a loop trying to connect and dropping the
connection continually - no way to fix automatically either
since no event is generated that a user mode program could use to
trigger recovery.
I'm not sure what the right answer is here, but I have a couple of
observations:
1. When we lose connection and are Primary/Inconsistent, should we
really create a new current UUID? The local data is bad.
2. When trying to connect and we are Primary/Inconsistent, shouldn't we
accept the connection even if the UUIDs are different?
Surely the uuid handshake will calculate the correct way to
synchronize in this case? It certainly would in the above
case (and you can see this in the trace from node1)
Here's the code in question:
if (mdev->state.conn < Connected &&
mdev->state.disk < Outdated &&
mdev->state.role == Primary &&
(mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] & ~((u64)1))) {
ERR("Can only connect to data with current UUID=%016llX\n",
(unsigned long long)mdev->ed_uuid);
drbd_force_state(mdev,NS(conn,Disconnecting));
return FALSE;
}
Any thoughts?
Simon
Trace from node1:
6719:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFConnection ->
WFReportParams )
6722:Aug 18 22:45:05 node1 kernel: drbd38: drbd_sync_handshake:
6723:Aug 18 22:45:05 node1 kernel: drbd38: self
CADAA290A29A95F0:0000000000000000:4CC7FE9AE8243C78:0000000000000006
6724:Aug 18 22:45:05 node1 kernel: drbd38: peer
678FDAD7A9A6D636:CADAA290A29A95F0:8ACAF333A2BE196A:4CC7FE9AE8243C78
6725:Aug 18 22:45:05 node1 kernel: drbd38: uuid_compare()=-1 by rule
5
6726:Aug 18 22:45:05 node1 kernel: drbd38: Becoming sync target due
to disk states.
6727:Aug 18 22:45:05 node1 kernel: drbd38: peer( Unknown -> Secondary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
peer_isp( 0 -> 1 )
6729:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFBitMapT ->
WFSyncUUID )
6730:Aug 18 22:45:05 node1 kernel: drbd38: exposed data uuid now
3633090B73366ED2
6731:Aug 18 22:45:05 node1 kernel: drbd38: uuid[Current] now
3633090B73366ED2
6732:Aug 18 22:45:05 node1 kernel: drbd38: uuid[Bitmap] now
0000000000000000
6734:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFSyncUUID ->
PausedSyncT )
6735:Aug 18 22:45:05 node1 kernel: drbd38: Began resync as
PausedSyncT (will sync 118668 KB [29667 bits set]).
7027:Aug 18 22:46:07 node1 kernel: drbd38: role( Secondary -> Primary
)
7437:Aug 18 22:46:39 node1 kernel: drbd38: meta connection shut down
by peer.
7438:Aug 18 22:46:39 node1 kernel: drbd38: peer( Secondary -> Unknown
) conn( PausedSyncT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
peer_isp( 1 -> 0 )
7456:Aug 18 22:46:39 node1 kernel: drbd38: sock was shut down by peer
7460:Aug 18 22:46:39 node1 kernel: drbd38: Creating new current UUID
7461:Aug 18 22:46:39 node1 kernel: drbd38: uuid[Bitmap] now
3633090B73366ED3
7462:Aug 18 22:46:39 node1 kernel: drbd38: exposed data uuid now
A398D35216141B4D
7463:Aug 18 22:46:39 node1 kernel: drbd38: uuid[Current] now
A398D35216141B4D
7464:Aug 18 22:46:39 node1 kernel: drbd38: Writing meta data super
block now.
7486:Aug 18 22:46:39 node1 kernel: drbd38: Connection closed
7487:Aug 18 22:46:39 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
7488:Aug 18 22:46:39 node1 kernel: drbd38: old = { cs:NetworkFailure
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7489:Aug 18 22:46:39 node1 kernel: drbd38: new = { cs:Unconnected
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7490:Aug 18 22:46:39 node1 kernel: drbd38: conn( NetworkFailure ->
Unconnected )
7491:Aug 18 22:46:39 node1 kernel: drbd38: receiver terminated
7492:Aug 18 22:46:39 node1 kernel: drbd38: receiver (re)started
7493:Aug 18 22:46:39 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
7494:Aug 18 22:46:39 node1 kernel: drbd38: old = { cs:Unconnected
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7495:Aug 18 22:46:39 node1 kernel: drbd38: new = { cs:WFConnection
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7496:Aug 18 22:46:39 node1 kernel: drbd38: conn( Unconnected ->
WFConnection )
7560:Aug 18 22:46:40 node1 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
7561:Aug 18 22:46:40 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
7562:Aug 18 22:46:40 node1 kernel: drbd38: old = { cs:WFConnection
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7563:Aug 18 22:46:40 node1 kernel: drbd38: new = { cs:WFReportParams
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
7564:Aug 18 22:46:40 node1 kernel: drbd38: conn( WFConnection ->
WFReportParams )
7580:Aug 18 22:46:40 node1 kernel: drbd38: Can only connect to data
with current UUID=A398D35216141B4D
7581:Aug 18 22:46:40 node1 kernel: drbd38: conn( WFReportParams ->
Disconnecting )
7586:Aug 18 22:46:40 node1 kernel: drbd38: Connection closed
Trace from Node0:
5185:Aug 18 22:45:06 node0 kernel: drbd38: conn( StandAlone ->
Unconnected )
5188:Aug 18 22:45:06 node0 kernel: drbd38: conn( Unconnected ->
WFConnection )
5197:Aug 18 22:45:06 node0 kernel: drbd38: aftr_isp( 0 -> 1 )
5248:Aug 18 22:45:06 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
5249:Aug 18 22:45:06 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams )
5250:Aug 18 22:45:06 node0 kernel: drbd38: Starting asender thread
(from drbd38_receiver [16796])
5251:Aug 18 22:45:06 node0 kernel: drbd38: data-integrity-alg:
<not-used>
5252:Aug 18 22:45:06 node0 kernel: drbd38: drbd_sync_handshake:
5253:Aug 18 22:45:06 node0 kernel: drbd38: self
678FDAD7A9A6D636:CADAA290A29A95F0:8ACAF333A2BE196A:4CC7FE9AE8243C78
5254:Aug 18 22:45:06 node0 kernel: drbd38: peer
CADAA290A29A95F0:0000000000000000:4CC7FE9AE8243C78:0000000000000006
5255:Aug 18 22:45:06 node0 kernel: drbd38: uuid_compare()=1 by rule 7
5256:Aug 18 22:45:06 node0 kernel: drbd38: Becoming sync source due
to disk states.
5257:Aug 18 22:45:06 node0 kernel: drbd38: peer( Unknown -> Secondary
) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
peer_isp( 0 -> 1 )
5258:Aug 18 22:45:06 node0 kernel: drbd38: Writing meta data super
block now.
5263:Aug 18 22:45:06 node0 kernel: drbd38: uuid[History_start] now
CADAA290A29A95F0
5264:Aug 18 22:45:06 node0 kernel: drbd38: uuid[Bitmap] now
3633090B73366ED2
5265:Aug 18 22:45:06 node0 kernel: drbd38: conn( WFBitMapS ->
PausedSyncS )
5266:Aug 18 22:45:06 node0 kernel: drbd38: Began resync as
PausedSyncS (will sync 118668 KB [29667 bits set]).
5267:Aug 18 22:45:06 node0 kernel: drbd38: Writing meta data super
block now.
5410:Aug 18 22:46:07 node0 kernel: drbd38: peer( Secondary -> Primary
)
5427:Aug 18 22:46:39 node0 kernel: drbd38: PingAck did not arrive in
time.
5428:Aug 18 22:46:39 node0 kernel: drbd38: peer( Primary -> Unknown )
conn( PausedSyncS -> NetworkFailure ) peer_isp( 1 -> 0 )
5434:Aug 18 22:46:39 node0 kernel: drbd38: Connection closed
5435:Aug 18 22:46:39 node0 kernel: drbd38: conn( NetworkFailure ->
Unconnected )
5438:Aug 18 22:46:39 node0 kernel: drbd38: conn( Unconnected ->
WFConnection )
5546:Aug 18 22:46:40 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
5547:Aug 18 22:46:40 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams )
5561:Aug 18 22:46:40 node0 kernel: drbd38: meta connection shut down
by peer.
5562:Aug 18 22:46:40 node0 kernel: drbd38: conn( WFReportParams ->
NetworkFailure )
5573:Aug 18 22:46:40 node0 kernel: drbd38: Connection closed
5574:Aug 18 22:46:40 node0 kernel: drbd38: conn( NetworkFailure ->
Unconnected )
5577:Aug 18 22:46:40 node0 kernel: drbd38: conn( Unconnected ->
WFConnection )
5598:Aug 18 22:46:43 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
5599:Aug 18 22:46:43 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams )
5600:Aug 18 22:46:43 node0 kernel: drbd38: Starting asender thread
(from drbd38_receiver [16796])
5602:Aug 18 22:46:43 node0 kernel: drbd38: drbd_sync_handshake:
5603:Aug 18 22:46:43 node0 kernel: drbd38: self
678FDAD7A9A6D636:3633090B73366ED2:CADAA290A29A95F0:8ACAF333A2BE196A
5604:Aug 18 22:46:43 node0 kernel: drbd38: peer
A398D35216141B4D:3633090B73366ED3:4CC7FE9AE8243C78:0000000000000006
5605:Aug 18 22:46:43 node0 kernel: drbd38: uuid_compare()=100 by rule
9
5606:Aug 18 22:46:43 node0 kernel: drbd38: Becoming sync source due
to disk states.
5607:Aug 18 22:46:43 node0 kernel: drbd38: peer( Unknown -> Primary )
conn( WFReportParams -> WFBitMapS ) peer_isp( 0 -> 1 )
5609:Aug 18 22:46:43 node0 kernel: drbd38: meta connection shut down
by peer.
5610:Aug 18 22:46:43 node0 kernel: drbd38: peer( Primary -> Unknown )
conn( WFBitMapS -> NetworkFailure ) peer_isp( 1 -> 0 )
5611:Aug 18 22:46:43 node0 kernel: drbd38: asender terminated
5612:Aug 18 22:46:43 node0 kernel: drbd38: Terminating asender thread
More information about the drbd-dev
mailing list