[DRBD-user] DRBD 9.0 - Resource stuck in "Connected" - "UpToDate/Inconsistent" state

Rossa, Alexander Alexander.Rossa at ncr.com
Mon Apr 25 19:01:30 CEST 2022


Hello all,

I have a question regarding an odd DRBD state that our resources got into shortly after bootstrap. Despite the resource claiming it is "Connected", it is actually out of sync and the disk state shows as "UpToDate/Inconsistent", but the resync never happens.

This is a Linux server running a Xen hypervisor where we use DRBD 9.0.30 to sync VM disks across.

We've seen this issue appear twice in the past few weeks across maybe a 100 or so of deployments, so it seems relatively rare, but we'd like to get to the bottom of it.

This is the post-bootstrap state (rendered in the format of now deprecated drbd-overview, all data is retrieved from drbdsetup status --json).

Side 7c:
1005:resource-5/0 Connected    Primary/Secondary   UpToDate/Inconsistent
Side 7d:
1005:resource-5/0 Connected    Secondary/Primary   Inconsistent/UpToDate

A drbdadm disconnect and subsequent drbdadm connect resolves the issue and the resource starts syncing.

Side 7c:
1005:resource-5/0 SyncSource   Primary/Secondary   UpToDate/Inconsistent
    [>.................................................] sync'ed: 0.4%
Side 7d:
1005:resource-5/0 SyncTarget   Secondary/Primary   Inconsistent/UpToDate
    [>.................................................] sync'ed: 0.4%

Looking at the kern.log entries I can see that the issue likely happened during handshake when the resource was to begin resync. For some reason, for `resource-5` there are two handshake attempt entries one after another (only 1 for any other resource):

Side 7c:
Apr 22 23:22:23 host-7c: [  801.305463] drbd resource-5/0 drbd1005 host-7d: drbd_sync_handshake:
Apr 22 23:22:23 host-7c: [  801.305466] drbd resource-5/0 drbd1005 host-7d: self E8D4DA04CB03E845:AC3365A2A7A0372F:0000000000000000:0000000000000000 bits:0 flags:20
Apr 22 23:22:23 host-7c: [  801.305468] drbd resource-5/0 drbd1005 host-7d: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:4
Apr 22 23:22:23 host-7c: [  801.305469] drbd resource-5/0 drbd1005 host-7d: uuid_compare()=source-set-bitmap by rule=just-created-peer
Apr 22 23:22:23 host-7c: [  801.305471] drbd resource-5/0 drbd1005 host-7d: Setting and writing one bitmap slot, after drbd_sync_handshake
Apr 22 23:22:23 host-7c: [  801.309659] drbd resource-5/0 drbd1005 host-7d: drbd_sync_handshake:
Apr 22 23:22:23 host-7c: [  801.309663] drbd resource-5/0 drbd1005 host-7d: self E8D4DA04CB03E845:AC3365A2A7A0372F:0000000000000000:0000000000000000 bits:0 flags:20
Apr 22 23:22:23 host-7c: [  801.309664] drbd resource-5/0 drbd1005 host-7d: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:4
Apr 22 23:22:23 host-7c: [  801.309666] drbd resource-5/0 drbd1005 host-7d: uuid_compare()=source-set-bitmap by rule=just-created-peer
Apr 22 23:22:23 host-7c: [  801.309667] drbd resource-5/0 drbd1005 host-7d: Setting and writing one bitmap slot, after drbd_sync_handshake
Apr 22 23:22:23 host-7c: [  801.310865] drbd resource-5/0 drbd1005 host-7d: repl( Established -> WFBitMapS ) resync-susp( peer -> no )
Apr 22 23:22:23 host-7c: [  801.311814] drbd resource-5/0 drbd1005 host-7d: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%

On the other side there is very little as a response, only the notification that the repl_state it saw wasn't expected:

Side 7d:
Apr 22 23:22:23 host-7d: [  954.942328] drbd resource-5/0 drbd1005 host-7c: resync-susp( peer -> no )
Apr 22 23:22:23 host-7d: [  954.947226] drbd resource-5/0 drbd1005 host-7c: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Apr 22 23:22:23 host-7d: [  954.947227] drbd resource-5/0 drbd1005 host-7c: unexpected repl_state (Established) in receive_bitmap

Inspecting the resource it indeed had data `out-of-sync` (as also evidenced by the resync on reconnect).

I have tried searching for the unexpected repl_state (Established) in receive_bitmap and found an open LINSTOR issue on GitHub: https://github.com/LINBIT/linstor-server/issues/218 - seems to be a similar situation as ours, but in LINSTOR. Hasn't been active for a while though and no insight into what the cause might be.

DRBD version info:

DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x09001e
DRBD_KERNEL_VERSION=9.0.30
DRBDADM_VERSION_CODE=0x091500
DRBDADM_VERSION=9.21.0-rc.1


Also attaching full entries for the resource in the kern.log for each side in case it's useful:

Side 7c:
Apr 22 23:11:09 host-7c: [  127.801308] drbd resource-5: Starting worker thread (from drbdsetup [20333])
Apr 22 23:11:09 host-7c: [  127.818388] drbd resource-5 host-7d: Starting sender thread (from drbdsetup [20343])
Apr 22 23:11:09 host-7c: [  127.847948] drbd resource-5/0 drbd1005: meta-data IO uses: blk-bio
Apr 22 23:11:09 host-7c: [  127.848912] drbd resource-5/0 drbd1005: disk( Diskless -> Attaching )
Apr 22 23:11:09 host-7c: [  127.848920] drbd resource-5/0 drbd1005: Maximum number of peer devices = 1
Apr 22 23:11:09 host-7c: [  127.849210] drbd resource-5: Method to ensure write ordering: flush
Apr 22 23:11:09 host-7c: [  127.849215] drbd resource-5/0 drbd1005: Adjusting my ra_pages to backing device's (32 -> 1)
Apr 22 23:11:09 host-7c: [  127.849230] drbd resource-5/0 drbd1005: drbd_bm_resize called with capacity == 41943040
Apr 22 23:11:09 host-7c: [  127.849550] drbd resource-5/0 drbd1005: resync bitmap: bits=5242880 words=81920 pages=160
Apr 22 23:11:09 host-7c: [  127.849560] drbd1005: detected capacity change from 0 to 21474836480
Apr 22 23:11:09 host-7c: [  127.849563] drbd resource-5/0 drbd1005: size = 20 GB (20971520 KB)
Apr 22 23:11:09 host-7c: [  127.850493] drbd resource-5/0 drbd1005: recounting of set bits took additional 0ms
Apr 22 23:11:09 host-7c: [  127.850499] drbd resource-5/0 drbd1005: disk( Attaching -> Inconsistent )
Apr 22 23:11:09 host-7c: [  127.850501] drbd resource-5/0 drbd1005: attached to current UUID: 0000000000000004
Apr 22 23:11:09 host-7c: [  127.854514] drbd resource-5 host-7d: conn( StandAlone -> Unconnected )
Apr 22 23:11:09 host-7c: [  127.854920] drbd resource-5 host-7d: Starting receiver thread (from drbd_w_0b6d3edd [20338])
Apr 22 23:11:09 host-7c: [  127.859587] drbd resource-5 host-7d: conn( Unconnected -> Connecting )
Apr 22 23:11:09 host-7c: [  128.387629] drbd resource-5 host-7d: Handshake to peer 0 successful: Agreed network protocol version 120
Apr 22 23:11:09 host-7c: [  128.387631] drbd resource-5 host-7d: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Apr 22 23:11:09 host-7c: [  128.387635] drbd resource-5 host-7d: Starting ack_recv thread (from drbd_r_0b6d3edd [20374])
Apr 22 23:11:09 host-7c: [  128.426356] drbd resource-5 host-7d: Preparing remote state change 1064560840
Apr 22 23:11:09 host-7c: [  128.439641] drbd resource-5/0 drbd1005 host-7d: drbd_sync_handshake:
Apr 22 23:11:09 host-7c: [  128.439643] drbd resource-5/0 drbd1005 host-7d: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
Apr 22 23:11:09 host-7c: [  128.439645] drbd resource-5/0 drbd1005 host-7d: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
Apr 22 23:11:09 host-7c: [  128.439646] drbd resource-5/0 drbd1005 host-7d: uuid_compare()=no-sync by rule=just-created-both
Apr 22 23:11:09 host-7c: [  128.442606] drbd resource-5 host-7d: Committing remote state change 1064560840 (primary_nodes=0)
Apr 22 23:11:09 host-7c: [  128.442613] drbd resource-5 host-7d: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Apr 22 23:11:09 host-7c: [  128.442615] drbd resource-5/0 drbd1005 host-7d: pdsk( DUnknown -> Inconsistent ) repl( Off -> Established )
Apr 22 23:11:25 host-7c: [  144.041168] drbd resource-5/0 drbd1005 host-7d: resync-susp( no -> peer )
Apr 22 23:11:25 host-7c: [  144.042271] drbd resource-5: Preparing cluster-wide state change 3912736869 (1->-1 7683/4609)
Apr 22 23:11:25 host-7c: [  144.042524] drbd resource-5: State change 3912736869: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
Apr 22 23:11:25 host-7c: [  144.042525] drbd resource-5: Committing cluster-wide state change 3912736869 (0ms)
Apr 22 23:11:25 host-7c: [  144.042530] drbd resource-5: role( Secondary -> Primary )
Apr 22 23:11:25 host-7c: [  144.042531] drbd resource-5/0 drbd1005: disk( Inconsistent -> UpToDate )
Apr 22 23:11:25 host-7c: [  144.042560] drbd resource-5/0 drbd1005: size = 20 GB (20971520 KB)
Apr 22 23:11:25 host-7c: [  144.043722] drbd resource-5: Forced to consider local data as UpToDate!
Apr 22 23:11:25 host-7c: [  144.045194] drbd resource-5/0 drbd1005: new current UUID: E8D4DA04CB03E845 weak: FFFFFFFFFFFFFFFC
Apr 22 23:22:23 host-7c: [  801.305463] drbd resource-5/0 drbd1005 host-7d: drbd_sync_handshake:
Apr 22 23:22:23 host-7c: [  801.305466] drbd resource-5/0 drbd1005 host-7d: self E8D4DA04CB03E845:AC3365A2A7A0372F:0000000000000000:0000000000000000 bits:0 flags:20
Apr 22 23:22:23 host-7c: [  801.305468] drbd resource-5/0 drbd1005 host-7d: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:4
Apr 22 23:22:23 host-7c: [  801.305469] drbd resource-5/0 drbd1005 host-7d: uuid_compare()=source-set-bitmap by rule=just-created-peer
Apr 22 23:22:23 host-7c: [  801.305471] drbd resource-5/0 drbd1005 host-7d: Setting and writing one bitmap slot, after drbd_sync_handshake
Apr 22 23:22:23 host-7c: [  801.309659] drbd resource-5/0 drbd1005 host-7d: drbd_sync_handshake:
Apr 22 23:22:23 host-7c: [  801.309663] drbd resource-5/0 drbd1005 host-7d: self E8D4DA04CB03E845:AC3365A2A7A0372F:0000000000000000:0000000000000000 bits:0 flags:20
Apr 22 23:22:23 host-7c: [  801.309664] drbd resource-5/0 drbd1005 host-7d: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:4
Apr 22 23:22:23 host-7c: [  801.309666] drbd resource-5/0 drbd1005 host-7d: uuid_compare()=source-set-bitmap by rule=just-created-peer
Apr 22 23:22:23 host-7c: [  801.309667] drbd resource-5/0 drbd1005 host-7d: Setting and writing one bitmap slot, after drbd_sync_handshake
Apr 22 23:22:23 host-7c: [  801.310865] drbd resource-5/0 drbd1005 host-7d: repl( Established -> WFBitMapS ) resync-susp( peer -> no )
Apr 22 23:22:23 host-7c: [  801.311814] drbd resource-5/0 drbd1005 host-7d: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%



Side 7d:
Apr 22 23:08:44 host-7d: [  136.465878] drbd resource-5: Starting worker thread (from drbdsetup [22498])
Apr 22 23:08:45 host-7d: [  136.482132] drbd resource-5 host-7c: Starting sender thread (from drbdsetup [22507])
Apr 22 23:08:45 host-7d: [  136.508338] drbd resource-5/0 drbd1005: meta-data IO uses: blk-bio
Apr 22 23:08:45 host-7d: [  136.508639] drbd resource-5/0 drbd1005: disk( Diskless -> Attaching )
Apr 22 23:08:45 host-7d: [  136.508645] drbd resource-5/0 drbd1005: Maximum number of peer devices = 1
Apr 22 23:08:45 host-7d: [  136.509135] drbd resource-5: Method to ensure write ordering: flush
Apr 22 23:08:45 host-7d: [  136.509138] drbd resource-5/0 drbd1005: Adjusting my ra_pages to backing device's (32 -> 1)
Apr 22 23:08:45 host-7d: [  136.509147] drbd resource-5/0 drbd1005: drbd_bm_resize called with capacity == 41943040
Apr 22 23:08:45 host-7d: [  136.509507] drbd resource-5/0 drbd1005: resync bitmap: bits=5242880 words=81920 pages=160
Apr 22 23:08:45 host-7d: [  136.509541] drbd1005: detected capacity change from 0 to 21474836480
Apr 22 23:08:45 host-7d: [  136.509544] drbd resource-5/0 drbd1005: size = 20 GB (20971520 KB)
Apr 22 23:08:45 host-7d: [  136.511572] drbd resource-5/0 drbd1005: recounting of set bits took additional 0ms
Apr 22 23:08:45 host-7d: [  136.511578] drbd resource-5/0 drbd1005: disk( Attaching -> Inconsistent )
Apr 22 23:08:45 host-7d: [  136.511580] drbd resource-5/0 drbd1005: attached to current UUID: 0000000000000004
Apr 22 23:08:45 host-7d: [  136.515806] drbd resource-5 host-7c: conn( StandAlone -> Unconnected )
Apr 22 23:08:45 host-7d: [  136.516244] drbd resource-5 host-7c: Starting receiver thread (from drbd_w_0b6d3edd [22501])
Apr 22 23:08:45 host-7d: [  136.520296] drbd resource-5 host-7c: conn( Unconnected -> Connecting )
Apr 22 23:11:10 host-7d: [  282.016563] drbd resource-5 host-7c: Handshake to peer 1 successful: Agreed network protocol version 120
Apr 22 23:11:10 host-7d: [  282.016568] drbd resource-5 host-7c: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Apr 22 23:11:10 host-7d: [  282.016595] drbd resource-5 host-7c: Starting ack_recv thread (from drbd_r_0b6d3edd [22530])
Apr 22 23:11:10 host-7d: [  282.054881] drbd resource-5: Preparing cluster-wide state change 1064560840 (0->1 499/146)
Apr 22 23:11:10 host-7d: [  282.070806] drbd resource-5/0 drbd1005 host-7c: drbd_sync_handshake:
Apr 22 23:11:10 host-7d: [  282.070809] drbd resource-5/0 drbd1005 host-7c: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
Apr 22 23:11:10 host-7d: [  282.070811] drbd resource-5/0 drbd1005 host-7c: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
Apr 22 23:11:10 host-7d: [  282.070812] drbd resource-5/0 drbd1005 host-7c: uuid_compare()=no-sync by rule=just-created-both
Apr 22 23:11:10 host-7d: [  282.071237] drbd resource-5: State change 1064560840: primary_nodes=0, weak_nodes=0
Apr 22 23:11:10 host-7d: [  282.071240] drbd resource-5: Committing cluster-wide state change 1064560840 (16ms)
Apr 22 23:11:10 host-7d: [  282.071279] drbd resource-5 host-7c: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Apr 22 23:11:10 host-7d: [  282.071285] drbd resource-5/0 drbd1005 host-7c: pdsk( DUnknown -> Inconsistent ) repl( Off -> Established )
Apr 22 23:11:26 host-7d: [  297.669646] drbd resource-5/0 drbd1005 host-7c: resync-susp( no -> after dependency )
Apr 22 23:11:26 host-7d: [  297.671299] drbd resource-5 host-7c: Preparing remote state change 3912736869
Apr 22 23:11:26 host-7d: [  297.672524] drbd resource-5/0 drbd1005 host-7c: pdsk( Inconsistent -> UpToDate )
Apr 22 23:11:26 host-7d: [  297.672555] drbd resource-5/0 drbd1005: size = 20 GB (20971520 KB)
Apr 22 23:11:26 host-7d: [  297.673190] drbd resource-5 host-7c: Committing remote state change 3912736869 (primary_nodes=2)
Apr 22 23:11:26 host-7d: [  297.673195] drbd resource-5 host-7c: peer( Secondary -> Primary )
Apr 22 23:22:23 host-7d: [  954.942328] drbd resource-5/0 drbd1005 host-7c: resync-susp( peer -> no )
Apr 22 23:22:23 host-7d: [  954.947226] drbd resource-5/0 drbd1005 host-7c: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Apr 22 23:22:23 host-7d: [  954.947227] drbd resource-5/0 drbd1005 host-7c: unexpected repl_state (Established) in receive_bitmap


Any help is much appreciated.

Regards,
Alexander


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20220425/44568c26/attachment-0001.htm>


More information about the drbd-user mailing list