[DRBD-user] Semantics of oos value, verification abortion

Christoph Lechleitner christoph.lechleitner at iteg.at
Fri Dec 29 23:07:45 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 2017-12-29 00:49, Christoph Lechleitner wrote:
> On 2017-12-29 00:30, Simon Ironside wrote:
>>
>> On 28/12/17 20:14, Christoph Lechleitner wrote:
>>
>>> Are you serious?
>>>
>>> Can someone from linbit please comment on this?
>>
>> This has come up a few times. From the last episode I remember, here is
>> Lars Ellenberg's (LINBIT) response:
>>
>> http://lists.linbit.com/pipermail/drbd-user/2014-February/020534.html
> 
> Thanks!
> 
> That confirms Veit's assessment and explains
> https://pve.proxmox.com/wiki/DRBD#WARNINGS
> 
> I'll go O_DIRECT hunting then ...

A little update for your entertainment and a potentially alarming
finding to start the new year with something to do ...


With commands like ...

  lsof -l -n -P +fg / /var/lib/lxc/*/rootfs |grep "W,"

  lsof -l -n -P -FcptnG +fg / /var/lib/lxc/*/rootfs |egrep -B 4 "^G"

  lsof -l -n -P -FG +fg / /var/lib/lxc/*/rootfs \
    |egrep "^G" |cut -c2- |sort |uniq

... I could not find files opened with O_DIRECT (flag "DIR").

All flags I found so far are (lsof output, meaning):
  AP = append
  LG = large file support
  ND = no delay (only postfix processes)
  NFLK = no follow links
  RW = Read write
  W = Write
  0x800000: close-on-exec

0x800000 = 02000000 = O_CLOEXEC in /usr/include/asm-generic/fcntl.h

Direct resp. O_DIRECT should show as DIR, but don't mix that up with DIR
for DIRectory in lsof's TYPE column.


But, either O_DIRECT apps I haven't found yet or something similar does
occur, syslogged as
  buffer modified by upper layers during write

What' worse and slightly ALARMING:
When this occurs a eventual verification is aborted!

This makes it near-impossible to find and fix oos-blocks on large
resources ;-(

Here is a (slightly pseudo-anonymized) syslog excerpt rg. a resource
that was verifying since 6:00:01:

Dec 29 06:50:54 node1 kernel: [35444774.750910] block drbd5: Digest
mismatch, buffer modified by upper layers during write: 176232240s +28672
Dec 29 06:50:54 node1 kernel: [35444774.772553] block drbd5: Online
Verify reached sector 1037581048
Dec 29 06:50:54 node1 kernel: [35444774.772581] drbd resource5: short
read (expected size 16)
Dec 29 06:50:54 node1 kernel: [35444774.772603] drbd resource5:
Terminating drbd_a_resource5
Dec 29 06:50:54 node1 kernel: [35444774.799453] drbd resource5:
Connection closed
Dec 29 06:50:54 node1 kernel: [35444774.799644] drbd resource5: receiver
terminated
Dec 29 06:50:54 node1 kernel: [35444774.799649] drbd resource5: receiver
(re)started
Dec 29 06:50:55 node1 kernel: [35444775.350731] block drbd5: peer(
Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk(
DUnknown -> Consistent )
Dec 29 06:50:55 node1 kernel: [35444775.369688] block drbd5: helper
command: /sbin/drbdadm before-resync-source minor-5
Dec 29 06:50:55 node1 kernel: [35444775.371732] block drbd5: conn(
WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Dec 29 06:50:55 node1 kernel: [35444775.371835] block drbd5: updated
sync UUID
01EE2BD9316E4E41:46C59A68FA436461:46C49A68FA436461:D6DE89088E93A90B
Dec 29 06:51:01 node1 kernel: [35444781.210333] block drbd6: conn(
Connected -> VerifyS )
Dec 29 06:51:01 node1 kernel: [35444781.210370] block drbd6: Starting
Online Verify from sector 0

I'm sorry for Thunderbird breaking lines, but HTML mails would be worse
in a mailing list, wouldn't they?


Intermediate conclusion:

I will try to find causes for "buffer modified by upper layers during
write" by comparing host syslog with guest system's syslog and eventual
application logs, which will take time.

But I'd be interested if others suffer from "upper layers" messages and
verification aborts too.


Regards, Christoph



More information about the drbd-user mailing list