Blocks fail to sync if connection lost during initial sync

Tim Westbrook Tim_Westbrook at selinc.com
Fri Mar 8 00:55:49 CET 2024


Hi 

I am running into an issue with disconnect during initial sync not resuming sync after reconnecting

While using drbd 9.2.7, primary and two secondaries, the primary has existing data that I need to preserve

The mesh is such that all nodes can reach all other nodes

1. After adding the new nodes and performing an adjust on the primary
2. The remote nodes start to initially sync (this usually takes 12 minutes in this test scenario (32g disks))
3. After 5 minutes I disable the network interface on the primary (the secondaries can still see each other)
4. Wait 20 minutes or longer - length doesn't seem to matter as long as is longer than normal sync time
5. Re-enable, instantly backup nodes are both up to date.
6. Run data integrity check, it fails marking a lot of data as out of sync

This is very reproducible

I have attached my configuration file[1] and the kernel logs snippets from primary[2] and a backup [3]

Here is the result of verify

[ 3960.135562] drbd persist/0 drbd0 node-40274784292070177166: Online verify done (total 379 sec; paused 0 sec; 72328 K/sec)
[ 3960.135570] drbd persist/0 drbd0 node-40274784292070177166: Online verify found 203568 4k blocks out of sync!

Is DRBD resilient to disconnects during initial sync? Should we not sync two secondary nodes at once?


I see there are some sync fixes in the RC for 9.2.8  , But not clear these may be related to rdma, I'm using tcp


Cheers, 

Tim

// Attachments 
[1] config


global {
        usage-count no;
        disable-ip-verification;
}

resource persist {
  device  /dev/drbd0;
  disk    /dev/strup-vg/strup-persist;

  disk {
    resync-rate 400M;
  }

  net {
    after-sb-1pri discard-secondary;
    verify-alg md5;
  }

  meta-disk /dev/strup-vg/strup-metapersist;

  on node-6927108042395771210 {
    node-id 1;
    address 172.20.0.1:7789;
  }

  on node-40274784292070177166 {
    node-id 2;
    address 172.20.0.2:7789;
  }

  on node-1010560022655352503 {
    node-id 3;
    address 172.20.0.3:7789;
  }

  connection-mesh {
    hosts node-6927108042395771210 node-40274784292070177166 node-1010560022655352503;
  }
}

[2] Primary Log on disconnect/reconnect

[  450.949393] drbd persist node-40274784292070177166: PingAck did not arrive in time.
[  450.949413] drbd persist node-40274784292070177166: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[  450.949419] drbd persist/0 drbd0 node-40274784292070177166: repl( SyncSource -> Off )
[  450.949440] drbd persist node-1010560022655352503: PingAck did not arrive in time.
[  450.949457] drbd persist node-1010560022655352503: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[  450.949462] drbd persist/0 drbd0 node-1010560022655352503: repl( SyncSource -> Off )
[  450.949756] drbd persist node-40274784292070177166: Terminating sender thread
[  450.949792] drbd persist node-1010560022655352503: Terminating sender thread
[  450.949826] drbd persist node-1010560022655352503: Starting sender thread (from drbd_r_persist [8657])
[  450.949852] drbd persist node-40274784292070177166: Starting sender thread (from drbd_r_persist [8581])
[  450.956448] drbd persist node-1010560022655352503: meta connection shut down by peer.
[  450.956514] drbd persist node-40274784292070177166: meta connection shut down by peer.
[  450.992921] drbd persist node-1010560022655352503: Connection closed
[  450.992942] drbd persist node-1010560022655352503: helper command: /sbin/drbdadm disconnected
[  450.993325] drbd persist node-40274784292070177166: Connection closed
[  450.993355] drbd persist node-40274784292070177166: helper command: /sbin/drbdadm disconnected
[  450.995126] drbd persist node-1010560022655352503: helper command: /sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  450.995183] drbd persist node-1010560022655352503: conn( NetworkFailure -> Unconnected ) [disconnected]
[  450.995202] drbd persist node-1010560022655352503: Restarting receiver thread
[  450.995214] drbd persist node-1010560022655352503: conn( Unconnected -> Connecting ) [connecting]
[  450.995255] drbd persist node-1010560022655352503: Configured local address not found, retrying every 10 sec, err=-99
[  450.995771] drbd persist node-40274784292070177166: helper command: /sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  450.995818] drbd persist node-40274784292070177166: conn( NetworkFailure -> Unconnected ) [disconnected]
[  450.995875] drbd persist node-40274784292070177166: Restarting receiver thread
[  450.995886] drbd persist node-40274784292070177166: conn( Unconnected -> Connecting ) [connecting]
[  450.995949] drbd persist node-40274784292070177166: Configured local address not found, retrying every 10 sec, err=-99
[ 2960.265825] drbd persist node-40274784292070177166: Handshake to peer 2 successful: Agreed network protocol version 122
[ 2960.265831] drbd persist node-40274784292070177166: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2960.266566] drbd persist: Preparing cluster-wide state change 2999333550 (1->2 499/145)
[ 2960.269601] drbd persist node-1010560022655352503: Handshake to peer 3 successful: Agreed network protocol version 122
[ 2960.269604] drbd persist node-1010560022655352503: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2960.275585] drbd persist/0 drbd0 node-40274784292070177166: drbd_sync_handshake:
[ 2960.275589] drbd persist/0 drbd0 node-40274784292070177166: self BFF38DFE83BC7649:0000000000000004:0000000000000000:0000000000000000 bits:1576288 flags:120
[ 2960.275593] drbd persist/0 drbd0 node-40274784292070177166: peer BEFE83A748FAC586:0000000000000000:D31B9930119D4DEC:BEFE83A748FAC586 bits:1554102 flags:1124
[ 2960.275597] drbd persist/0 drbd0 node-40274784292070177166: uuid_compare()=source-copy-other-bitmap by rule=bitmap-self-other
[ 2960.275599] drbd persist/0 drbd0 node-40274784292070177166: Copying bitmap of peer node_id=0 (bitmap_index=2)
[ 2960.300345] drbd persist: State change 2999333550: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFF9
[ 2960.300353] drbd persist: Committing cluster-wide state change 2999333550 (33ms)
[ 2960.300389] drbd persist node-40274784292070177166: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [connected]
[ 2960.300394] drbd persist/0 drbd0 node-40274784292070177166: repl( Off -> WFBitMapS ) [connected]
[ 2960.303497] drbd persist/0 drbd0 node-40274784292070177166: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 1046(1), total 1046; compression: 99.9%
[ 2960.306045] drbd persist/0 drbd0 node-40274784292070177166: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 1046(1), total 1046; compression: 99.9%
[ 2960.306058] drbd persist/0 drbd0 node-40274784292070177166: helper command: /sbin/drbdadm before-resync-source
[ 2960.306257] drbd persist/0 drbd0 node-40274784292070177166: helper command: /sbin/drbdadm before-resync-source exit code 255 (0xfffffffe)
[ 2960.306279] drbd persist/0 drbd0 node-40274784292070177166: repl( WFBitMapS -> SyncSource ) [receive-bitmap]
[ 2960.306349] drbd persist/0 drbd0 node-40274784292070177166: Began resync as SyncSource (will sync 244744 KB [61186 bits set]).
[ 2960.321526] drbd persist: Preparing cluster-wide state change 1704893824 (1->3 499/145)
[ 2960.330564] drbd persist/0 drbd0 node-1010560022655352503: drbd_sync_handshake:
[ 2960.330569] drbd persist/0 drbd0 node-1010560022655352503: self BFF38DFE83BC7649:0000000000000004:D31B9930119D4DEC:0000000000000000 bits:1912416 flags:120
[ 2960.330574] drbd persist/0 drbd0 node-1010560022655352503: peer 0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000 bits:1890230 flags:1124
[ 2960.330579] drbd persist/0 drbd0 node-1010560022655352503: uuid_compare()=source-set-bitmap by rule=just-created-peer
[ 2960.330582] drbd persist/0 drbd0 node-1010560022655352503: Setting and writing one bitmap slot, after drbd_sync_handshake
[ 2960.331946] drbd persist/0 drbd0: FIXME kworker/1:0[21] op clear, bitmap locked for 'set_n_write sync_handshake' by drbd_r_persist[8657]
[ 2960.334034] drbd persist/0 drbd0: bitmap page idx 52 changed during IO!
[ 2960.334040] drbd persist/0 drbd0: bitmap page idx 53 changed during IO!
[ 2960.334043] drbd persist/0 drbd0: bitmap page idx 55 changed during IO!
[ 2960.335522] drbd persist/0 drbd0 node-1010560022655352503: updated bitmap UUID BFF38DFE83BC7649:0000000000000004:D31B9930119D4DEC:0000000000000000
[ 2960.337827] drbd persist: State change 1704893824: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFF1
[ 2960.337834] drbd persist: Committing cluster-wide state change 1704893824 (16ms)
[ 2960.341188] drbd persist node-1010560022655352503: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [connected]
[ 2960.341192] drbd persist/0 drbd0 node-1010560022655352503: repl( Off -> WFBitMapS ) [connected]
[ 2960.343712] drbd persist/0 drbd0 node-1010560022655352503: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2960.351952] drbd persist/0 drbd0 node-1010560022655352503: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2960.351968] drbd persist/0 drbd0 node-1010560022655352503: helper command: /sbin/drbdadm before-resync-source
[ 2960.353490] drbd persist/0 drbd0 node-1010560022655352503: helper command: /sbin/drbdadm before-resync-source exit code 255 (0xfffffffe)
[ 2960.353558] drbd persist/0 drbd0 node-1010560022655352503: repl( WFBitMapS -> SyncSource ) [receive-bitmap]
[ 2960.353687] drbd persist/0 drbd0 node-1010560022655352503: Began resync as SyncSource (will sync 26599424 KB [6649856 bits set]).
[ 2963.145539] drbd persist/0 drbd0 node-40274784292070177166: updated UUIDs BFF38DFE83BC7649:0000000000000000:D31B9930119D4DEC:0000000000000000
[ 2963.150900] drbd persist/0 drbd0 node-40274784292070177166: Resync done (total 2 sec; paused 0 sec; 122372 K/sec)
[ 2963.150913] drbd persist/0 drbd0 node-40274784292070177166: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established ) [resync-finished]
[ 3004.217498] drbd persist/0 drbd0 node-1010560022655352503: updated UUIDs BFF38DFE83BC7649:0000000000000000:D31B9930119D4DEC:0000000000000000
[ 3004.228794] drbd persist/0 drbd0 node-1010560022655352503: Resync done (total 43 sec; paused 0 sec; 618588 K/sec)
[ 3004.228806] drbd persist/0 drbd0 node-1010560022655352503: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established ) [resync-finished]

[3] Backup log on disconnect/reconnect
[  295.174846] drbd persist node-6927108042395771210: Connection closed
[  295.174863] drbd persist node-6927108042395771210: helper command: /sbin/drbdadm disconnected
[  295.175088] drbd persist node-6927108042395771210: helper command: /sbin/drbdadm disconnected exit code 255 (0xfffffffe)
[  295.175113] drbd persist node-6927108042395771210: conn( NetworkFailure -> Unconnected ) [disconnected]
[  295.175131] drbd persist node-6927108042395771210: Restarting receiver thread
[  295.175142] drbd persist node-6927108042395771210: conn( Unconnected -> Connecting ) [connecting]
[ 2804.185311] drbd persist node-40274784292070177166: Preparing remote state change 2999333550
[ 2804.188050] drbd persist node-6927108042395771210: Handshake to peer 1 successful: Agreed network protocol version 122
[ 2804.188059] drbd persist node-6927108042395771210: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[ 2804.219640] drbd persist node-40274784292070177166: Committing remote state change 2999333550 (primary_nodes=2)
[ 2804.224622] drbd persist/0 drbd0 node-40274784292070177166: resync-susp( no -> peer ) [peer-state]
[ 2804.243564] drbd persist node-6927108042395771210: Preparing remote state change 1704893824
[ 2804.248755] drbd persist/0 drbd0 node-6927108042395771210: drbd_sync_handshake:
[ 2804.248761] drbd persist/0 drbd0 node-6927108042395771210: self 0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000 bits:1890230 flags:124
[ 2804.248769] drbd persist/0 drbd0 node-6927108042395771210: peer BFF38DFE83BC7649:0000000000000004:0000000000000000:0000000000000000 bits:1912416 flags:1120
[ 2804.248777] drbd persist/0 drbd0 node-6927108042395771210: uuid_compare()=target-set-bitmap by rule=just-created-self
[ 2804.248783] drbd persist/0 drbd0 node-6927108042395771210: Setting and writing the whole bitmap, fresh node
[ 2804.259577] drbd persist node-6927108042395771210: Committing remote state change 1704893824 (primary_nodes=2)
[ 2804.259598] drbd persist node-6927108042395771210: conn( Connecting -> Connected ) peer( Unknown -> Primary ) [remote]
[ 2804.259607] drbd persist/0 drbd0 node-6927108042395771210: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT ) [remote]
[ 2804.263408] drbd persist/0 drbd0 node-6927108042395771210: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2804.268930] drbd persist/0 drbd0 node-6927108042395771210: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[ 2804.268954] drbd persist/0 drbd0 node-6927108042395771210: helper command: /sbin/drbdadm before-resync-target
[ 2804.269181] drbd persist/0 drbd0 node-6927108042395771210: helper command: /sbin/drbdadm before-resync-target exit code 255 (0xfffffffe)
[ 2804.269194] drbd persist/0 drbd0 node-6927108042395771210: setting UUIDs to 0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2804.269211] drbd persist/0 drbd0 node-6927108042395771210: repl( WFBitMapT -> SyncTarget ) [receive-bitmap]
[ 2804.269272] drbd persist/0 drbd0 node-6927108042395771210: Began resync as SyncTarget (will sync 26599424 KB [6649856 bits set]).
[ 2807.061425] drbd persist/0 drbd0 node-40274784292070177166: pdsk( Inconsistent -> UpToDate ) [peer-state]
[ 2807.061442] drbd persist/0 drbd0: Multiple UpToDate peers have different current UUIDs
[ 2807.061448] drbd persist/0 drbd0: node-40274784292070177166: BEFE83A748FAC586 node-6927108042395771210: BFF38DFE83BC7649
[ 2807.061453] drbd persist/0 drbd0: Setting exposed data uuid: BEFE83A748FAC586
[ 2807.061535] drbd persist/0 drbd0: Multiple UpToDate peers have different current UUIDs
[ 2807.061541] drbd persist/0 drbd0: node-40274784292070177166: BEFE83A748FAC586 node-6927108042395771210: BFF38DFE83BC7649
[ 2807.064043] drbd persist/0 drbd0 node-40274784292070177166: updated UUIDs 0000000000000004:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2807.069167] drbd persist/0 drbd0: Setting exposed data uuid: BFF38DFE83BC7648
[ 2848.135754] drbd persist/0 drbd0 node-6927108042395771210: Resync done (total 43 sec; paused 0 sec; 618588 K/sec)
[ 2848.135776] drbd persist/0 drbd0 node-6927108042395771210: updated UUIDs BFF38DFE83BC7648:0000000000000000:BEFE83A748FAC586:0000000000000000
[ 2848.135801] drbd persist/0 drbd0: disk( Inconsistent -> UpToDate ) [resync-finished]
[ 2848.135809] drbd persist/0 drbd0 node-40274784292070177166: resync-susp( connection dependency -> no ) [resync-finished]
[ 2848.135818] drbd persist/0 drbd0 node-6927108042395771210: repl( SyncTarget -> Established ) [resync-finished]
[ 2848.136253] drbd persist/0 drbd0 node-6927108042395771210: helper command: /sbin/drbdadm after-resync-target
[ 2848.136483] drbd persist/0 drbd0 node-6927108042395771210: helper command: /sbin/drbdadm after-resync-target exit code 255 (0xfffffffe)




More information about the drbd-user mailing list