Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 2017-12-29 00:49, Christoph Lechleitner wrote: > On 2017-12-29 00:30, Simon Ironside wrote: >> >> On 28/12/17 20:14, Christoph Lechleitner wrote: >> >>> Are you serious? >>> >>> Can someone from linbit please comment on this? >> >> This has come up a few times. From the last episode I remember, here is >> Lars Ellenberg's (LINBIT) response: >> >> http://lists.linbit.com/pipermail/drbd-user/2014-February/020534.html > > Thanks! > > That confirms Veit's assessment and explains > https://pve.proxmox.com/wiki/DRBD#WARNINGS > > I'll go O_DIRECT hunting then ... A little update for your entertainment and a potentially alarming finding to start the new year with something to do ... With commands like ... lsof -l -n -P +fg / /var/lib/lxc/*/rootfs |grep "W," lsof -l -n -P -FcptnG +fg / /var/lib/lxc/*/rootfs |egrep -B 4 "^G" lsof -l -n -P -FG +fg / /var/lib/lxc/*/rootfs \ |egrep "^G" |cut -c2- |sort |uniq ... I could not find files opened with O_DIRECT (flag "DIR"). All flags I found so far are (lsof output, meaning): AP = append LG = large file support ND = no delay (only postfix processes) NFLK = no follow links RW = Read write W = Write 0x800000: close-on-exec 0x800000 = 02000000 = O_CLOEXEC in /usr/include/asm-generic/fcntl.h Direct resp. O_DIRECT should show as DIR, but don't mix that up with DIR for DIRectory in lsof's TYPE column. But, either O_DIRECT apps I haven't found yet or something similar does occur, syslogged as buffer modified by upper layers during write What' worse and slightly ALARMING: When this occurs a eventual verification is aborted! This makes it near-impossible to find and fix oos-blocks on large resources ;-( Here is a (slightly pseudo-anonymized) syslog excerpt rg. a resource that was verifying since 6:00:01: Dec 29 06:50:54 node1 kernel: [35444774.750910] block drbd5: Digest mismatch, buffer modified by upper layers during write: 176232240s +28672 Dec 29 06:50:54 node1 kernel: [35444774.772553] block drbd5: Online Verify reached sector 1037581048 Dec 29 06:50:54 node1 kernel: [35444774.772581] drbd resource5: short read (expected size 16) Dec 29 06:50:54 node1 kernel: [35444774.772603] drbd resource5: Terminating drbd_a_resource5 Dec 29 06:50:54 node1 kernel: [35444774.799453] drbd resource5: Connection closed Dec 29 06:50:54 node1 kernel: [35444774.799644] drbd resource5: receiver terminated Dec 29 06:50:54 node1 kernel: [35444774.799649] drbd resource5: receiver (re)started Dec 29 06:50:55 node1 kernel: [35444775.350731] block drbd5: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Dec 29 06:50:55 node1 kernel: [35444775.369688] block drbd5: helper command: /sbin/drbdadm before-resync-source minor-5 Dec 29 06:50:55 node1 kernel: [35444775.371732] block drbd5: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Dec 29 06:50:55 node1 kernel: [35444775.371835] block drbd5: updated sync UUID 01EE2BD9316E4E41:46C59A68FA436461:46C49A68FA436461:D6DE89088E93A90B Dec 29 06:51:01 node1 kernel: [35444781.210333] block drbd6: conn( Connected -> VerifyS ) Dec 29 06:51:01 node1 kernel: [35444781.210370] block drbd6: Starting Online Verify from sector 0 I'm sorry for Thunderbird breaking lines, but HTML mails would be worse in a mailing list, wouldn't they? Intermediate conclusion: I will try to find causes for "buffer modified by upper layers during write" by comparing host syslog with guest system's syslog and eventual application logs, which will take time. But I'd be interested if others suffer from "upper layers" messages and verification aborts too. Regards, Christoph