[DRBD-user] Semantics and possible sources of out-of-sync blocks with DRBD8

Christoph Lechleitner christoph.lechleitner at iteg.at
Wed Jun 27 16:06:02 CEST 2018

Hello everybody!

We run 4 different (but similar) pairs of servers with 5 to 50 DRBD 8 resources each (one resource per LXC guest, backed by LVM2 volumes, i.e. DRBD on LVM) with resource sizes from 2 GB to 2 TB.

3 pairs of servers run Debian's 4.9 kernel, one pair runs 4.14.
DRBD module versions are 8.4.7 (probably with bits backported) resp. 8.4.10.

For half a year now we are running regular 
  drbdadm verify
on all DRBD resources, one after the other.

Unfortunately about half of the resources show between 4 and 10000s out-of-sync blocks (i.e. non-0 numbers after oos: in /proc/drbd) after most verify runs.

This may have been asked before, but does a non-0 number after oos: in /proc/drbd *really* mean there are definitely inconsistent data that the sync process (resp. the blocks' consistency flags) would not know about without the verification run?

Or could the normal sync process fool the verification by writing a block on the primary between the reads by the verification process?

As in:

1. verification reads block on machine A

2. block device driver writes block on primary and flags it as changed.
This knowningly creates an inconsistent state (and there's nothing wrong about that).

3. verification reads block on machine B

4. block device driver resp. sync writes block on secondary, and flags it as clean on the primary.
This resolvies the planned temporary inconsistency, but has fooled the verification process into a false negative.

So, if the oos: blocks are real data errors, we are out of ideas except hoping an update to kernel 4.16 resp. DRBD module 8.4.10 might solve the issues.
(We are going to upgrade some nodes to 4.16, which requires to activate apparmor, btw.)

We have thoroughly searched (filtered kernel calls) for O_DIRECT operations, and have found nothing except one Oracle process (one test machine only, don't care) and LVM's regular reads (supposedly outside the DRBD resources).

Is there any known source for out-of-sync blocks apart from O_DIRECT, faulty hardware, and (unknown) bugs?

Regards, Christoph


Christoph Lechleitner


ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck
FN 365826f | Handelsgericht Innsbruck | Mobiltelefon: +43 676 3674710
Mail: christoph.lechleitner at iteg.at | Web: http://www.iteg.at/

More information about the drbd-user mailing list