[DRBD-user] protocol C replication - unexpected behaviour

Thu Aug 26 15:23:33 CEST 2021

On Thu, Aug 05, 2021 at 11:53:44PM +0200, Janusz Jaskiewicz wrote:
> Hello.
> 
> I'm experimenting a bit with DRBD in a cluster managed by Pacemaker.
> It's a two node, active-passive cluster and the service that I'm
> trying to put in the cluster writes to the file system.
> The service manages many files, it appends to some of them and
> increments the single integer value in others.
> 
> I'm observing surprising behaviour and I would like to ask you
> if what I see is expected or not (I think not).
> 
> I'm using protocol C, but still I see some delay in the files
> that are being replicated to the secondary server.
> For the files that increment the integer I see a difference
> which corresponds roughly to 1 second of traffic.
> 
> I'm really surprised to see this, as protocol C should guarantee
> synchronous replication.
> I'd rather expect some delay in processing (potentially slower
> disk writes due to the network replication).
> 
> The way I'm testing it:
> The service runs on primary and writes to DRBD drive, secondary
> connected and "UpToDate".
> I kill the service abruptly (kill -9) and then take down the
> network interface between primary and secondary (kill and ifdown
> commands in the script so executed quite promptly one after the
> other).
> Then I mount the DRBD drive on both nodes and check the
> difference in the files with incrementing integer.
> 
> I would appreciate any help or pointers on how to fix this.
> But first of all I would like to confirm that this behaviour is
> not expected.
> 
> Also if it is expected/allowed, how can I decrease the impact?

In short: use fsync(). Or use something that does (databases).

A failover is very similar to a hard reset (crash and reboot).
If your "service" or "application" is not crash safe, 
DRBD can not magically make it "failover safe".

"protocol C" (synchronous replication) is only "failover safe"
if the service using it is itself "crash safe" to begin with.

My (educated) guess is that your service is not "crash safe".

I suspect that if you use a single node, no DRBD, no pacemaker,
and instead of your "kill -9; ifdown"
you do "echo b > /proc/sysrq-trigger",
you will find very similar "unexpected" behavior.

If you need your "incremented integer" to be on stable storage,
you need to fsync() after the change.

An "fprintf" library call comes back when the change is in the
local library IO buffers.
If you "fflush" things, that will cause a write(),
but the "write()" system call comes back when the change is in the
operating system buffers aka "page cache".

If you do not use fflush, the library will not even flush changes
to the operating system.
If you do not use fsync or equivalents, the operating system
will do the write out of changes ("dirty pages") at its leisure,
in any order, once it "feels" like it.

Which all means DRBD has not even seen those changes yet,
until the actual write-out to stable storage happens.

If you need your changes to hit "stable storge" in a specific
order, you need coordination (that usually means locks) between
different threads or processes of your service, and you need fsync
at the right places.

If you need "transactions", and you need to be able to "recover"
from a partial change, then maybe you want to use an actual
database -- they solved all these problems for you already.

    Lars