<div dir="ltr"><br><div class="gmail_extra"><br>
<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><br>
> Most of the time (99%) I see ERR for the swap space of virtual machines.<br>
<br>
</div>If you enable "integrity-alg", do you still see those "buffer modified<br>
by upper layers during write"?<br>
<br>
Well, then that is your problem,<br>
and that problem can *NOT* be fixed with DRBD "config tuning".<br>
<br>
What does that mean?<br>
<br>
Upper layer submits write to DRBD.<br>
DRBD calculates checksum over data buffer.<br>
DRBD sends that checksum.<br>
DRBD submits data buffer to "local" backend block device.<br>
Meanwhile, upper layer changes data buffer.<br>
DRBD sends data buffer to peer.<br>
DRBD receives local completion.<br>
DRBD receives remote ACK.<br>
DRBD completes this write to upper layer.<br>
*only now* would the upper layer be "allowed"<br>
to change that data buffer again.<br>
<br>
Misbehaving upper layer results in potentially divergent blocks<br>
on the DRBD peers. Or would result in potentially divergent blocks on<br>
a local software RAID 1. Which is why the mdadm maintenance script<br>
in rhel, "raid-check", intended to be run periodically from cron,<br>
has this tell-tale chunk:<br>
mismatch_cnt=`cat /sys/block/$dev/md/mismatch_cnt`<br>
# Due to the fact that raid1/10 writes in the kernel are unbuffered,<br>
# a raid1 array can have non-0 mismatch counts even when the<br>
# array is healthy. These non-0 counts will only exist in<br>
# transient data areas where they don't pose a problem. However,<br>
# since we can't tell the difference between a non-0 count that<br>
# is just in transient data or a non-0 count that signifies a<br>
# real problem, simply don't check the mismatch_cnt on raid1<br>
# devices as it's providing far too many false positives. But by<br>
# leaving the raid1 device in the check list and performing the<br>
# check, we still catch and correct any bad sectors there might<br>
# be in the device.<br>
raid_lvl=`cat /sys/block/$dev/md/level`<br>
if [ "$raid_lvl" = "raid1" -o "$raid_lvl" = "raid10" ]; then<br>
continue<br>
fi<br>
<br>
Anyways.<br>
Point being: Either have those upper layers stop modifying buffers<br>
while they are in-flight (keyword: "stable pages").<br>
Kernel upgrade within the VMs may do it. Changing something in the<br>
"virtual IO path configuration" may do it. Or not.<br>
<br>
Or live with the results, which are<br>
potentially not identical blocks on the DRBD peers.<br>
<span class=""><font color="#888888"></font></span></blockquote><div><br></div><div>Hello Lars,<br><br></div><div>Thank you for the detailed explanation. I've done some more tests and found that "out of sync" sectors appear for master-slave also, not only for master-master.<br>
<br>Can you share your thoughts about what can cause upper layer changes in the following schema?<br></div><div>KVM (usually virtio) -> LVM -> DRBD -> RAID10 -> Physical drives, while LVM snapshots are not used.<br>
<br></div><div>Can LVM cause these OOS? Could it help if we replace by the following schema?<br>KVM (usually virtio) -> DRBD -> LVM -> RAID10 -> Physical drives, while LVM snapshots are not used.<br><br></div>
<div>Stanislav<br></div></div></div></div>