[DRBD-user] drbd90 unexpected split-brain, possible uuid_compare issue

Mon Oct 19 16:59:26 CEST 2020

Hi,

drbd90 kernel module version:9.0.22-2(also 9.0.25-1 compiled from source)
drbd90-utils:9.12.2-1
kernel:3.10.0-1127.18.2.el7.x86_64

4 nodes
n1 primary
n2,n3,n4 all secondary

If I run the folowing script then sometimes, after the starts, some of the nodes get stuck in Outdated or Inconsistent state forever. The loop generally works correctly several times(max was about 14) before getting stuck.
I have NOT been able to replicate the spilt-brain state in this way but I think it is related.

while true
do
  ssh n4 'service corosync stop'
  ssh n3 'service corosync stop'
  ssh n2 'service corosync stop'
  ssh n1 'service corosync stop'
  sleep 5
  ssh n1 'service pacemaker start'
  ssh n2 'service pacemaker start'
  ssh n3 'service pacemaker start'
  ssh n4 'service pacemaker start'
  #At this point r0 resource is mounted on /home
  #and processes are writing to it on n1
  while true
  do
    sleep 1
    echo "events2 `date`"
    drbdsetup events2 --now  -c
    num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
    echo "num UpToDate=$num_u2d"
    [ "$num_u2d" = 4 ]&&break
    sleep 4
  done
done

If I change the stop order to
  n1,n4,n3,n2
and the start order to
  n2,n3,n4,n1
There is never a problem, I left this running over the weekend and it worked over 5 thousand times.

I replicated the same issue using drbdadm commands directly so this is not a corosync/pacemaker issue,
i.e. running following script on n1 also produces the problem:-
while true
do
  for n in 4 3 2
  do
    ssh n$n 'drbdadm down r0'
  done
  #stop process that write to /home
  a2ksys stop >> a2ksys_stop.log 2>&1
  while [ -d /home/cem ]
  do
    umount /home
    [ ! -d /home/cem ]&&break
    echo lsof output
    lsof /home
    sleep 5
  done
  drbdadm secondary r0
  drbdadm down r0
  #END OF STOP

  drbdadm up r0
  while true
  do
    drbdadm primary r0&&break
    echo "sleep 5 before retry primary"
    sleep 5
  done
  while true
  do
    mount -orw /dev/drbd0 /home&&break
    echo "sleep 5 before retry mount"
    sleep 5
  done
  #start processes that do some writes HERE
  a2ksys start >> a2ksys_start.log 2>&1
  ssh n2 'drbdadm up r0'
  ssh n3 'drbdadm up r0'
  ssh n4 'drbdadm up r0'
  while true
  do
    sleep 1
    echo "events2 `date`"
    drbdsetup events2 --now  -c
    num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
    echo "num UpToDate=$num_u2d"
    [ "$num_u2d" = 4 ]&&break
    sleep 4
  done
done

r0.res:-
resource r0 {
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
    unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
    split-brain "/bin/touch /tmp/drbd_split_brain.flg";
  }
  startup {
    wfc-timeout            0;  ## Infinite!
    degr-wfc-timeout     120;  ## 2 minutes.
    outdated-wfc-timeout 120;
  }
  disk {
    resync-rate 100M;
    on-io-error detach;
    disable-write-same;
  }
  net {
    protocol C;
  }

  device     /dev/drbd0;
  disk       /dev/VolGroup00/lv_home;
  meta-disk  /dev/VolGroup00/lv_drbd_meta [0];

  on n1 {
    address 192.168.52.151:7789;
    node-id 1;
  }
  on n2 {
    address 192.168.52.152:7789;
    node-id 2;
  }
  on n3 {
    address 192.168.53.151:7789;
    node-id 3;
  }
  on n4 {
    address 192.168.53.152:7789;
    node-id 4;
  }

  connection-mesh {
    hosts n1 n2 n3 n4;
  }
}

Is there any more information I can provide that would help to track down this issue?

Regards,
Jeremy Faith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20201019/b8a72988/attachment-0001.htm>