[DRBD-user] Semantics of oos value, verification abortion

Thu Dec 28 21:14:37 CET 2017

On 2017-12-28 13:32, Veit Wahlich wrote:
> Hi Christoph, 
> 
> I do not have experience with the precise functioning of LXC disk storage, but I assume that every operation that could cause oos applies to every application running inside the LXC containers, too.
> 
> A common cause, that I suspect here, is opening a file (or block device) using O_DIRECT. This flag is used to reduce I/O latency and especially bypass the page cache, but it also allows buffers to be modified in-flight while they are processed by e.g. DRBD. So not only DRBD is affected by this, but also software RAID such as mdraid, dmraid or lvmraid, and I bet even block caching such as bcache.

Are you serious?

Can someone from linbit please comment on this?

This would basically mean that DRBD is useless whenever an application
opens files with O_DIRECT!?

How could a fast path to user space render the replication of the
underlying block device useless?

> In most cases O_DIRECT is used by applications such as some DBMS to avoid caching by the kernel, as they implement their own cache or do not want the kernel to sacrifice memory on page caching as the data written will not be used again.
> 
> So my recommendation is to check your logs/monitoring if the oos has only occurred repeatedly on certain containers, and then inspect the applications' configuration running inside for the use of O_DIRECT (which can usually be disabled).
> If it has been occurring on all your containers, I would suspect your LXC configuration itself as the cause, such as an overlay filesystem or container image. 

Checking 1000s of applications in 100s of containers is NOT an option.

Regards, Christoph