[DRBD-user] KAISER healing? Re: Semantics of oos value, verification abortion

Christoph Lechleitner christoph.lechleitner at iteg.at
Sun Apr 1 19:41:32 CEST 2018


Am 20.01.18 um 01:58 schrieb Veit Wahlich:
> Hi Christoph, 
> 
> this might also be caused by other patches backported to make the KPTI patches work with your running kernel. 

After some observation neither the KAISER patches nor any other kernel changes have changed anything.

After some uptime (about 2 months) the PROBLEM REAPPEARED.

The number of out-of-sync-blocks varies and there are resources that don't usually suffer from the problem.
But there are way too many oos-blocks on way too many resources (on rather bored, idle nodes) to live with the sitatuation long term.

The intensity fluctuates, propably dependend on all kindes of parameters (power of the hardware, load, protocol version, dynamic vs. fixed sync rate, i.e. timing and the weather), but it won't vanish.

It is definitely not an O_DIRECT problem though.
With the help of an actual kernel developer we used several techniques up to a custom kernel module and function hooks to trace O_DIRECT operations.
The only process that uses O_DIRECT to write on disc is our Oracle test machine.

Either there is something new along the line of O_DIRECT, or the hardware/driver/network combination is so unreliably (unlikely), or whatever.
 
Unfortunately we are out of ideas, mid-term we'll have to switch to something else, and sadly it feels like following the industry away from DRBD ;-(

Regards,

Christoph


More information about the drbd-user mailing list