[DRBD-user] KAISER healing? Re: Semantics of oos value, verification abortion

Veit Wahlich cru.lists at zodia.de
Sat Jan 20 01:58:06 CET 2018


Hi Christoph, 

this might also be caused by other patches backported to make the KPTI patches work with your running kernel. 

But indeed I think it is possible that the KPTI patches sail around your problem as e. g. syscalls now block longer and prevent your application to manipulate buffers. Also might your kernel (or tunables profile) come with changed default values for process scheduling, e. g. serving longer time slices to (kernel) processes or change CPU migration rate to minimise the count of context changes in order to lower the impacts of the Meltdown/Spectre patches, thus simply giving higher probability of finishing to process the data before another process is scheduled that might change it in-flight. I could think of even a few more scenarios. 

Best regards, 
// Veit 


-------- Ursprüngliche Nachricht --------
Von: Christoph Lechleitner <christoph.lechleitner at iteg.at>
Gesendet: 20. Januar 2018 01:29:38 MEZ
An: drbd-user at lists.linbit.com
CC: Wolfgang Glas <wolfgang.glas at iteg.at>
Betreff: [DRBD-user] KAISER healing? Re:  Semantics of oos value, verification abortion

Am 31.12.17 um 17:21 schrieb Christoph Lechleitner:
> On 2017-12-29 23:07, Christoph Lechleitner wrote:
>> On 2017-12-29 00:49, Christoph Lechleitner wrote:
>>
>> What' worse and slightly ALARMING:
>> When this occurs a eventual verification is aborted!
> 
> Maybe it's not that alarming.
> 
> After I failed to find anything in the logfiles of a LXC guest that is
> almost unused but regularily affected I found out those
>   "buffer modified by upper layers during write"
> might be false positives, in swap or append situations.
> 
> While we don't use DRBD for swap, appending to logfiles happens all the
> time.
> 
> I'll follow the recommendation there:
> - switch off data-integrity-alg
> - keep up the regular verify runs

This "works" in the sense that the verify runs now seem to finish (as
opposed to abort).

Although I wouldn't call it "working" as long as certain LXC guest
partitions continue to produce oos-Blocks on a daily basis.

We already have set up an appointment with a kernel expert to find
O_DIRECT "sinners" (by means of the kernel's features for function call
tracing).


BUT, the KAISER/KPTI patches seem to prevent further oos-Blocks.

Or maybe the inconsistencies go undetected now, maybe even hitting the
primary node ;-)

Next update in a few weeks ...


Regards, Christoph

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list