[DRBD-user] drbd90 unexpected split-brain, possible uuid_compare issue
Jeremy Faith
jeremy.faith at jci.com
Mon Oct 19 16:59:26 CEST 2020
Hi,
drbd90 kernel module version:9.0.22-2(also 9.0.25-1 compiled from source)
drbd90-utils:9.12.2-1
kernel:3.10.0-1127.18.2.el7.x86_64
4 nodes
n1 primary
n2,n3,n4 all secondary
If I run the folowing script then sometimes, after the starts, some of the nodes get stuck in Outdated or Inconsistent state forever. The loop generally works correctly several times(max was about 14) before getting stuck.
I have NOT been able to replicate the spilt-brain state in this way but I think it is related.
while true
do
ssh n4 'service corosync stop'
ssh n3 'service corosync stop'
ssh n2 'service corosync stop'
ssh n1 'service corosync stop'
sleep 5
ssh n1 'service pacemaker start'
ssh n2 'service pacemaker start'
ssh n3 'service pacemaker start'
ssh n4 'service pacemaker start'
#At this point r0 resource is mounted on /home
#and processes are writing to it on n1
while true
do
sleep 1
echo "events2 `date`"
drbdsetup events2 --now -c
num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
echo "num UpToDate=$num_u2d"
[ "$num_u2d" = 4 ]&&break
sleep 4
done
done
If I change the stop order to
n1,n4,n3,n2
and the start order to
n2,n3,n4,n1
There is never a problem, I left this running over the weekend and it worked over 5 thousand times.
I replicated the same issue using drbdadm commands directly so this is not a corosync/pacemaker issue,
i.e. running following script on n1 also produces the problem:-
while true
do
for n in 4 3 2
do
ssh n$n 'drbdadm down r0'
done
#stop process that write to /home
a2ksys stop >> a2ksys_stop.log 2>&1
while [ -d /home/cem ]
do
umount /home
[ ! -d /home/cem ]&&break
echo lsof output
lsof /home
sleep 5
done
drbdadm secondary r0
drbdadm down r0
#END OF STOP
drbdadm up r0
while true
do
drbdadm primary r0&&break
echo "sleep 5 before retry primary"
sleep 5
done
while true
do
mount -orw /dev/drbd0 /home&&break
echo "sleep 5 before retry mount"
sleep 5
done
#start processes that do some writes HERE
a2ksys start >> a2ksys_start.log 2>&1
ssh n2 'drbdadm up r0'
ssh n3 'drbdadm up r0'
ssh n4 'drbdadm up r0'
while true
do
sleep 1
echo "events2 `date`"
drbdsetup events2 --now -c
num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
echo "num UpToDate=$num_u2d"
[ "$num_u2d" = 4 ]&&break
sleep 4
done
done
r0.res:-
resource r0 {
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
split-brain "/bin/touch /tmp/drbd_split_brain.flg";
}
startup {
wfc-timeout 0; ## Infinite!
degr-wfc-timeout 120; ## 2 minutes.
outdated-wfc-timeout 120;
}
disk {
resync-rate 100M;
on-io-error detach;
disable-write-same;
}
net {
protocol C;
}
device /dev/drbd0;
disk /dev/VolGroup00/lv_home;
meta-disk /dev/VolGroup00/lv_drbd_meta [0];
on n1 {
address 192.168.52.151:7789;
node-id 1;
}
on n2 {
address 192.168.52.152:7789;
node-id 2;
}
on n3 {
address 192.168.53.151:7789;
node-id 3;
}
on n4 {
address 192.168.53.152:7789;
node-id 4;
}
connection-mesh {
hosts n1 n2 n3 n4;
}
}
Is there any more information I can provide that would help to track down this issue?
Regards,
Jeremy Faith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20201019/b8a72988/attachment-0001.htm>
More information about the drbd-user
mailing list