Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
I do futher investigation.
1. All of hardware firmwares are up to date so far but nothing has changed.
All of tcp offload features are disabled for all of 4 ethernet controllers.
2. I have created a small script for comparing out-of-sync blocks:
------------------------------------------------------------------------
#!/bin/bash
#echo 'Mar 31 10:24:04 virt1 kernel: block drbd0: Out of sync:
start=1036171232, size=8 (sectors)'
while read line; do
if [[ $line =~ Out\ of\ sync:\ start=([0-9]+),\ size=([0-9]+) ]];
then
start=${BASH_REMATCH[1]}
size=${BASH_REMATCH[2]}
echo $start - $size
sum1=$(ssh 10.1.2.1 dd iflag=direct if=/dev/drbd0 bs=512
skip=$start count=$size 2>/dev/null < /dev/null | md5sum | awk '{print $1}')
sum2=$(ssh 10.1.2.2 dd iflag=direct if=/dev/drbd0 bs=512
skip=$start count=$size 2>/dev/null < /dev/null | md5sum | awk '{print $1}')
if [[ $sum1 = $sum2 ]]; then
echo OK: $sum1 - $sum2
else
echo ERR: $sum1 - $sum2
ssh 10.1.2.1 dd iflag=direct if=/dev/drbd0 bs=512
skip=$start count=$size 2>/dev/null < /dev/null > /tmp/${start}_${size}_1
ssh 10.1.2.2 dd iflag=direct if=/dev/drbd0 bs=512
skip=$start count=$size 2>/dev/null < /dev/null > /tmp/${start}_${size}_2
fi
fi
done
------------------------------------------------------------------------
Comaring found only couple of matches and a lot of differs
3. Todays out-of-sync blocks are related to VM number 109. I did the
following:
- turned off this VM
- copy logical volume to file:
dd if=/dev/drbd-lvm-0/vm-109-disk-1 of=/tmp/vm-109-disk-1 bs=1M
- copy logical volume back from file:
dd if=/tmp/vm-109-disk-1 of=/dev/drbd-lvm-0/vm-109-disk-1 bs=1M
4. Run comparing script again and the script shows that all blocks are
matched
(that is very good because I don't need to stop any of dual-master nodes
and don't need to have a risk to make a wrong way sync, in the worst case
(if both of nodes have VMs with out-of-sync blocks) I can't even do that
without loosing data)
Next step -> I'll try to remove (physically) one connection from my RR
bondning and leave only one of them. And then will wait for new verifying
results.
Any ideas so far?
Regards,
Stanislav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130402/82bd412a/attachment.htm>