Hello,<br><br>I do futher investigation.<br><br>1. All of hardware firmwares are up to date so far but nothing has changed. All of tcp offload features are disabled for all of 4 ethernet controllers.<br>2. I have created a small script for comparing out-of-sync blocks:<br>
------------------------------------------------------------------------<br>#!/bin/bash<br><br>#echo 'Mar 31 10:24:04 virt1 kernel: block drbd0: Out of sync: start=1036171232, size=8 (sectors)' <br>while read line; do<br>
if [[ $line =~ Out\ of\ sync:\ start=([0-9]+),\ size=([0-9]+) ]]; then<br> start=${BASH_REMATCH[1]}<br> size=${BASH_REMATCH[2]}<br> echo $start - $size<br> sum1=$(ssh 10.1.2.1 dd iflag=direct if=/dev/drbd0 bs=512 skip=$start count=$size 2>/dev/null < /dev/null | md5sum | awk '{print $1}')<br>
sum2=$(ssh 10.1.2.2 dd iflag=direct if=/dev/drbd0 bs=512 skip=$start count=$size 2>/dev/null < /dev/null | md5sum | awk '{print $1}')<br> if [[ $sum1 = $sum2 ]]; then<br> echo OK: $sum1 - $sum2<br>
else<br> echo ERR: $sum1 - $sum2<br> ssh 10.1.2.1 dd iflag=direct if=/dev/drbd0 bs=512 skip=$start count=$size 2>/dev/null < /dev/null > /tmp/${start}_${size}_1<br>
ssh 10.1.2.2 dd iflag=direct if=/dev/drbd0 bs=512 skip=$start count=$size 2>/dev/null < /dev/null > /tmp/${start}_${size}_2<br> fi<br> fi<br>done<br>------------------------------------------------------------------------<br>
Comaring found only couple of matches and a lot of differs<br>3. Todays out-of-sync blocks are related to VM number 109. I did the following:<br>- turned off this VM<br>- copy logical volume to file:<br>dd if=/dev/drbd-lvm-0/vm-109-disk-1 of=/tmp/vm-109-disk-1 bs=1M<br>
- copy logical volume back from file:<br>dd if=/tmp/vm-109-disk-1 of=/dev/drbd-lvm-0/vm-109-disk-1 bs=1M<br>4. Run comparing script again and the script shows that all blocks are matched<br>(that is very good because I don't need to stop any of dual-master nodes and don't need to have a risk to make a wrong way sync, in the worst case (if both of nodes have VMs with out-of-sync blocks) I can't even do that without loosing data)<br>
<br>Next step -> I'll try to remove (physically) one connection from my RR bondning and leave only one of them. And then will wait for new verifying results.<br><br>Any ideas so far?<br><br>Regards,<br>Stanislav<br>