[DRBD-user] 8 Zettabytes out-of-sync?
Eddie Chapman
eddie at ehuk.net
Fri Nov 2 11:33:33 CET 2018
On 02/11/18 08:45, Jarno Elonen wrote:
> More clues:
>
> Just witnessed a resync (after invalidate) to steadily go from 100%
> out-of-sync to 0% (after several automatic disconnects and reconnects).
> Immediately after reaching 0%, it went to negative -<very-large-number>%
> ! After that, drbdtop started showing 8.0ZiB out-of-sync.
>
> Looks like a severe wrap-around bug.
>
> -Jarno
>
>
> On Thu, 1 Nov 2018 at 22:30, Jarno Elonen <elonen at iki.fi
> <mailto:elonen at iki.fi>> wrote:
>
> Here's some more info.
> Dmesg shows some suspicious looking log message, such as:
>
> 1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
> bitmap' by drbd_r_vm-117-s[5038]
>
> 2) Wrong magic value 0xffff0007 in protocol version 114
>
> 3) peer request with dagtag 399201392 not found
> got_peer_ack [drbd] failed
>
> 4) Rejecting concurrent remote state change 2226202936 because of
> state change 2923939731
> Ignoring P_TWOPC_ABORT packet 2226202936.
>
> 5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
> bitmap already locked for 'write from resync_finished' by
> drbd_w_vm-117-s[2812]
> md_sync_timer expired! Worker calls drbd_md_sync().
>
> 6) incompatible discard-my-data settings
> conn( Connecting -> Disconnecting )
> error receiving P_PROTOCOL, e: -5 l: 7!
>
> Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
> of them API v 16:
>
> == mox-a ==
> version: 9.0.15-1 (api:2/proto:86-114)
> GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
> root at mox-a, 2018-10-28 03:08:58
> Transports (api:16): tcp (9.0.15-1)
>
> == mox-b ==
> version: 9.0.15-1 (api:2/proto:86-114)
> GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
> root at mox-b, 2018-10-10 17:50:25
> Transports (api:16): tcp (9.0.15-1)
>
> == mox-c ==
> version: 9.0.16-1 (api:2/proto:86-114)
> GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
> root at mox-c, 2018-10-28 05:45:05
> Transports (api:16): tcp (9.0.16-1)
>
> == mox-d ==
> version: 9.0.16-1 (api:2/proto:86-114)
> GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
> root at mox-d, 2018-10-29 00:22:23
> Transports (api:16): tcp (9.0.16-1)
>
> Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
> resources being managed by LINSTOR.
>
>
> On Thu, 1 Nov 2018 at 17:32, Jarno Elonen <elonen at iki.fi
> <mailto:elonen at iki.fi>> wrote:
>
> Okay, today one of these resources got a sudden, severe
> filesystem corruption on the primary.
>
> On the other hand, the secondaries (that showed 8ZiB
> out-of-sync) were still mountable after I disconnected the
> corrupted primary. No idea how current data the secondaries had,
> but drbdtop still showed them as connected and 8Zib out-of-sync.
>
> This is getting quite worrisome. Is anyone else experiencing
> this with DRBD 9? Is it something really wrong in my setup, or
> are there perhaps some known instabilities in DRBD 9.0.15-1?
>
> -Jarno
>
>
> On Wed, 31 Oct 2018 at 20:46, Jarno Elonen <elonen at iki.fi
> <mailto:elonen at iki.fi>> wrote:
>
> I've got several DRBD 9 resource that constantly show
> *UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
> OutOfDate data.
>
> Any idea what might cause this and how to fix it?
>
> Example:
>
> # drbdsetup status --verbose --statistics vm-106-disk-1
> vm-106-disk-1 node-id:0 role:Primary suspended:no
> write-ordering:flush
> volume:0 minor:1003 disk:UpToDate quorum:yes
> size:16777688 read:215779 written:22369564
> al-writes:89 bm-writes:0 upper-pending:0
> lower-pending:0 al-suspended:no blocked:no
> mox-a node-id:1 connection:Connected role:Secondary
> congested:no ap-in-flight:0
> rs-in-flight:18446744073709549808
> volume:0 replication:Established peer-disk:UpToDate
> resync-suspended:no
> received:215116 sent:22368903
> out-of-sync:9223372036854774304 pending:0 unacked:0
> mox-c node-id:2 connection:Connected role:Secondary
> congested:no ap-in-flight:0
> rs-in-flight:18446744073709549808
> volume:0 replication:Established peer-disk:UpToDate
> resync-suspended:no
> received:1188 sent:19884428 out-of-sync:0 pending:0
> unacked:0
>
> Version info:
> version: 9.0.15-1 (api:2/proto:86-114)
> GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
> root at mox-b, 2018-10-10 17:50:25
> Transports (api:16): tcp (9.0.15-1)
>
> -Jarno
Not exactly the same issue you are seeing, but I have had an issue this
week with a newly created resource on a 9.0.16-1 primary against a
9.0.13-1 secondary.
As soon as I started writing to the new primary the secondary started
repeatedly disconnecting with the error:
drbd resource274 primary.host: Unexpected data packet ? (0x0036)
followed by resync (and then same error again, followed by resync, ....)
Probably completely unrelated to your issues, and I know there is a
_lot_ of bug fixes between 9.0.13-1 and 9.0.16-1 (and I _do_ have have a
long overdue update of the secondary planned v. soon).
Theoretically, different 9.0.x kernel versions should be able work
together (same api). But in practice, I avoid it and usually update drbd
& kernel at same time on all nodes.
So it could be that 9.0.16-1 has particular problems with co-operating
with earlier version, perhaps more so than other versions.
Eddie
More information about the drbd-user
mailing list