[DRBD-user] 8 Zettabytes out-of-sync?

Fri Nov 2 11:33:33 CET 2018

On 02/11/18 08:45, Jarno Elonen wrote:
> More clues:
> 
> Just witnessed a resync (after invalidate) to steadily go from 100% 
> out-of-sync to 0% (after several automatic disconnects and reconnects). 
> Immediately after reaching 0%, it went to negative -<very-large-number>% 
> ! After that, drbdtop started showing 8.0ZiB out-of-sync.
> 
> Looks like a severe wrap-around bug.
> 
> -Jarno
> 
> 
> On Thu, 1 Nov 2018 at 22:30, Jarno Elonen <elonen at iki.fi 
> <mailto:elonen at iki.fi>> wrote:
> 
>     Here's some more info.
>     Dmesg shows some suspicious looking log message, such as:
> 
>     1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
>     bitmap' by drbd_r_vm-117-s[5038]
> 
>     2) Wrong magic value 0xffff0007 in protocol version 114
> 
>     3) peer request with dagtag 399201392 not found
>     got_peer_ack [drbd] failed
> 
>     4) Rejecting concurrent remote state change 2226202936 because of
>     state change 2923939731
>     Ignoring P_TWOPC_ABORT packet 2226202936.
> 
>     5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
>     bitmap already locked for 'write from resync_finished' by
>     drbd_w_vm-117-s[2812]
>     md_sync_timer expired! Worker calls drbd_md_sync().
> 
>     6) incompatible discard-my-data settings
>     conn( Connecting -> Disconnecting )
>     error receiving P_PROTOCOL, e: -5 l: 7!
> 
>     Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
>     of them API v 16:
> 
>     == mox-a ==
>     version: 9.0.15-1 (api:2/proto:86-114)
>     GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
>     root at mox-a, 2018-10-28 03:08:58
>     Transports (api:16): tcp (9.0.15-1)
> 
>     == mox-b ==
>     version: 9.0.15-1 (api:2/proto:86-114)
>     GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
>     root at mox-b, 2018-10-10 17:50:25
>     Transports (api:16): tcp (9.0.15-1)
> 
>     == mox-c ==
>     version: 9.0.16-1 (api:2/proto:86-114)
>     GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
>     root at mox-c, 2018-10-28 05:45:05
>     Transports (api:16): tcp (9.0.16-1)
> 
>     == mox-d ==
>     version: 9.0.16-1 (api:2/proto:86-114)
>     GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
>     root at mox-d, 2018-10-29 00:22:23
>     Transports (api:16): tcp (9.0.16-1)
> 
>     Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
>     resources being managed by LINSTOR.
> 
> 
>     On Thu, 1 Nov 2018 at 17:32, Jarno Elonen <elonen at iki.fi
>     <mailto:elonen at iki.fi>> wrote:
> 
>         Okay, today one of these resources got a sudden, severe
>         filesystem corruption on the primary.
> 
>         On the other hand, the secondaries (that showed 8ZiB
>         out-of-sync) were still mountable after I disconnected the
>         corrupted primary. No idea how current data the secondaries had,
>         but drbdtop still showed them as connected and 8Zib out-of-sync.
> 
>         This is getting quite worrisome. Is anyone else experiencing
>         this with DRBD 9? Is it something really wrong in my setup, or
>         are there perhaps some known instabilities in DRBD 9.0.15-1?
> 
>         -Jarno
> 
> 
>         On Wed, 31 Oct 2018 at 20:46, Jarno Elonen <elonen at iki.fi
>         <mailto:elonen at iki.fi>> wrote:
> 
>             I've got several DRBD 9 resource that constantly show
>             *UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
>             OutOfDate data.
> 
>             Any idea what might cause this and how to fix it?
> 
>             Example:
> 
>             # drbdsetup status --verbose --statistics vm-106-disk-1
>             vm-106-disk-1 node-id:0 role:Primary suspended:no
>                  write-ordering:flush
>                volume:0 minor:1003 disk:UpToDate quorum:yes
>                    size:16777688 read:215779 written:22369564
>             al-writes:89 bm-writes:0 upper-pending:0
>                    lower-pending:0 al-suspended:no blocked:no
>                mox-a node-id:1 connection:Connected role:Secondary
>             congested:no ap-in-flight:0
>                    rs-in-flight:18446744073709549808
>                  volume:0 replication:Established peer-disk:UpToDate
>             resync-suspended:no
>                      received:215116 sent:22368903
>             out-of-sync:9223372036854774304 pending:0 unacked:0
>                mox-c node-id:2 connection:Connected role:Secondary
>             congested:no ap-in-flight:0
>                    rs-in-flight:18446744073709549808
>                  volume:0 replication:Established peer-disk:UpToDate
>             resync-suspended:no
>                      received:1188 sent:19884428 out-of-sync:0 pending:0
>             unacked:0
> 
>             Version info:
>             version: 9.0.15-1 (api:2/proto:86-114)
>             GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
>             root at mox-b, 2018-10-10 17:50:25
>             Transports (api:16): tcp (9.0.15-1)
> 
>             -Jarno

Not exactly the same issue you are seeing, but I have had an issue this 
week with a newly created resource on a 9.0.16-1 primary against a 
9.0.13-1 secondary.

As soon as I started writing to the new primary the secondary started 
repeatedly disconnecting with the error:

drbd resource274 primary.host: Unexpected data packet ? (0x0036)

followed by resync (and then same error again, followed by resync, ....)

Probably completely unrelated to your issues, and I know there is a 
_lot_ of bug fixes between 9.0.13-1 and 9.0.16-1 (and I _do_ have have a 
long overdue update of the secondary planned v. soon).

Theoretically, different 9.0.x kernel versions should be able work 
together (same api). But in practice, I avoid it and usually update drbd 
& kernel at same time on all nodes.

So it could be that 9.0.16-1 has particular problems with co-operating 
with earlier version, perhaps more so than other versions.

Eddie