[DRBD-user] Dual primary and LVM

Lars Ellenberg lars.ellenberg at linbit.com
Mon Aug 7 12:59:51 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Thu, Jul 27, 2017 at 10:11:48AM +0200, Gionatan Danti wrote:
> To clarify: the main reason I am asking about the feasibility of a
> dual-primary DRBD setup with LVs on top of it is about cache coherency. Let
> me do a step back: the given explaination for deny even read access on a
> secondary node is of broken cache coherency/consistency: if the read/write
> node writes something the secondary node had previously read, the latter
> will not recognize the changes done by the first node. The canonical
> solution to this problem is to use a dual-primary setup with a clustered
> filesystem (eg: GFS2) which not only arbitrates write access, but maintains
> read cache consistency also.
> Now, let's remove the clustered filesystem layer, leaving "naked" LVs only.
> How read cache coherency is mantained in this case? As no filesystem is
> layered on top of the raw LVs, there is not real pagecache at work, but the
> kernel's buffers remains - and they need to be made coherents. How DRBD
> achieves this? Does it update the receiving kernel I/O buffers each time the
> other node writes something?

DRBD does not at all interact with the layers above it,
so it does not know, and does not care, which entities may or may not
have cached data they read earlier.
Any entities that need cache coherency accross multiple instances
need to coordinate in some way.

But that is not DRBD specific at all,
and not even specific to clustering or multi-node setups.

This means that if you intend to use something that is NOT cluster aware
(or multi-instance aware) itself, you may need to add your own band-aid
locking and flushing "somewhere".

I remember that "in the old days", kernel buffer pages may linger for
quite some time, even if the corresponding devices was no longer open,
which caused problems with migrating VMs even with something as a shared
scsi device.  Integration scripts added explicit calls to sync and
blockdev --flushbufs and the like...

The kernel then learned to invalidate cache pages on last close,
so these hacks are no longer necessary (as long as no-one keeps
the device open when not actively used).

The other alternative is to always use "direct IO".

You can (destructively!) experiment with dual primary drbd,
make both nodes primary,

on node A,
watch -n1 "dd if=/dev/drbd0 bs=4096 count=1 | strings"
watch -n1 "dd if=/dev/drbd0 bs=4096 count=1 iflag=direct | strings"

on node B,
while sleep 0.1; do date +%F_%T.%N | dd of=/dev/drbd0 bs=4096 iflag=sync of=direct; done

iflag=sync padds with NUL to full bs,
of=direct makes sure it finds its way to DRBD
and not just into buffer cache pages

You should see both "watch" thingies show the date changes written on
the other node.

If you then do on "node A": sleep 10 < /dev/drbd0,
the non if=direct watch should show the same date for ten seconds,
because it gets its data from buffer cache, and the device is kept open
by the sleep.

Once the "open count" of the device drops down to zero again, the kernel
will invalidate the pages, and the next read will need to re-read from
disk (just as the "direct" read always does).

You can then do
"sleep 10 </dev/drbd10 & sleep 5 ; blockdev --flushbufs /dev/drbd0; wait",
and see the non-direct watch update the date just once after 5 seconds,
and then again once the sleep 10 has finished...

Again, this does not really have anything to do with DRBD,
but with how the kernel treats block devices,
and if and how entities coordinate alternating and concurrent access
to "things". 

You can easily have two entities on the same node corrupt a boring plain
text file on a classic file system on just a single node, if they both
assume "exclusive access", and don't coordinate properly.

: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
please don't Cc me, but send to list -- I'm subscribed

More information about the drbd-user mailing list