[DRBD-user] massive latency increases from the slave with barrier or flush enabled

Fri Jun 24 15:27:54 CEST 2011

On 23/06/11 10:49, Phil Stoneman wrote:
> I need a bit of help understanding an issue we're seeing with our DRBD
> setup. We have two Debian systems, with DRBD 8.3.8. We have a resource
> configured with protocol A, which backs onto a single SATA disk on each
> side.
>
> I am running the latency test from the DRBD documentation, multiple times:
> while true; do dd if=/dev/zero of=$TEST_DEVICE bs=512 count=1000
> oflag=direct; sync; done
>
> If I stop drbd and run it against the backing device on each side, it
> repeats again and again very quickly. If I start DRBD and run it against
> the DRBD device on the primary side, it runs quickly for about 4 or 5
> repetitions, then slows right down. Investigation shows pe: is climbing
> in /proc/drbd, and iostat -mxd 1 on the secondary node shows 100% disk
> usage for the backing device. Note that this repeats in the other
> direction if I swap primary/secondary roles. It's only the secondary
> role that's seeing 100% disk usage, not the primary.
>
> When I use no-disk-barrier and no-disk-flushes, the problem goes away
> entirely - but I'm reluctant to enable this permanently, as they're just
> normal SATA drives without any battery backup or anything, and there are
> scary warnings about doing that in the documentation :-)

After a bunch more testing, it looks like DRBD on the secondary side 
only is not using (or regularly flushing) the write cache of the 
underlying storage. It's not doing this on the primary side, and I can 
actually see a reference to this in the manual[1]: "DRBD uses disk 
flushes for write operations both to its replicated data set and to its 
meta data.".

I must be honest, I don't completely understand the rationale behind 
utilising the write cache on the primary side but not utilising it on 
the secondary side - it really hurts performance in some use cases!

Still, now that I know what's going on, I'm a little more comfortable 
using no-disk-barrier and no-disk-flushes. It means that I might lose 
data written into the drive's write cache, but that's no worse a 
situation than a normal system using the hard drives natively.

I'm still interested to hear the reason behind why drbd works that way 
though...

Phil

[1] http://www.drbd.org/users-guide/s-disk-flush-support.html