[DRBD-user] drbd90 unexpected split-brain, possible uuid_compare issue

Igor Cicimov igorc at encompasscorporation.com
Tue Oct 20 05:50:13 CEST 2020


On Tue, Oct 20, 2020 at 4:17 AM Jeremy Faith <jeremy.faith at jci.com> wrote:

> Hi,
>
> drbd90 kernel module version:9.0.22-2(also 9.0.25-1 compiled from source)drbd90-utils:9.12.2-1
> kernel:3.10.0-1127.18.2.el7.x86_64
>
> 4 nodes
> n1 primary
> n2,n3,n4 all secondary
>
> If I run the folowing script then sometimes, after the starts, some of
> the nodes get stuck in Outdated or Inconsistent state forever. The loop
> generally works correctly several times(max was about 14) before getting
> stuck.
> I have NOT been able to replicate the spilt-brain state in this way but I
> think it is related.
>
> while true
> do
>   ssh n4 'service corosync stop'
>   ssh n3 'service corosync stop'
>   ssh n2 'service corosync stop'
>   ssh n1 'service corosync stop'
>   sleep 5
>   ssh n1 'service pacemaker start'
>   ssh n2 'service pacemaker start'
>   ssh n3 'service pacemaker start'
>   ssh n4 'service pacemaker start'
>   #At this point r0 resource is mounted on /home
>   #and processes are writing to it on n1
>   while true
>   do
>     sleep 1
>     echo "events2 `date`"
>     drbdsetup events2 --now  -c
>     num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
>     echo "num UpToDate=$num_u2d"
>     [ "$num_u2d" = 4 ]&&break
>     sleep 4
>   done
> done
>
> If I change the stop order to
>   n1,n4,n3,n2
> and the start order to
>   n2,n3,n4,n1
> There is never a problem, I left this running over the weekend and it
> worked over 5 thousand times.
>
> I replicated the same issue using drbdadm commands directly so this is not
> a corosync/pacemaker issue,
> i.e. running following script on n1 also produces the problem:-
> while true
> do
>   for n in 4 3 2
>   do
>     ssh n$n 'drbdadm down r0'
>   done
>   #stop process that write to /home
>   a2ksys stop >> a2ksys_stop.log 2>&1
>   while [ -d /home/cem ]
>   do
>     umount /home
>     [ ! -d /home/cem ]&&break
>     echo lsof output
>     lsof /home
>     sleep 5
>   done
>   drbdadm secondary r0
>   drbdadm down r0
>   #END OF STOP
>
>   drbdadm up r0
>   while true
>   do
>     drbdadm primary r0&&break
>     echo "sleep 5 before retry primary"
>     sleep 5
>   done
>   while true
>   do
>     mount -orw /dev/drbd0 /home&&break
>     echo "sleep 5 before retry mount"
>     sleep 5
>   done
>   #start processes that do some writes HERE
>   a2ksys start >> a2ksys_start.log 2>&1
>   ssh n2 'drbdadm up r0'
>   ssh n3 'drbdadm up r0'
>   ssh n4 'drbdadm up r0'
>   while true
>   do
>     sleep 1
>     echo "events2 `date`"
>     drbdsetup events2 --now  -c
>     num_u2d=`drbdsetup events2 --now|grep -c disk:UpToDate`
>     echo "num UpToDate=$num_u2d"
>     [ "$num_u2d" = 4 ]&&break
>     sleep 4
>   done
> done
>
> r0.res:-
> resource r0 {
>   handlers {
>     fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
>     unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
>     split-brain "/bin/touch /tmp/drbd_split_brain.flg";
>   }
>   startup {
>     wfc-timeout            0;  ## Infinite!
>     degr-wfc-timeout     120;  ## 2 minutes.
>     outdated-wfc-timeout 120;
>   }
>   disk {
>     resync-rate 100M;
>     on-io-error detach;
>     disable-write-same;
>   }
>   net {
>     protocol C;
>   }
>
>   device     /dev/drbd0;
>   disk       /dev/VolGroup00/lv_home;
>   meta-disk  /dev/VolGroup00/lv_drbd_meta [0];
>
>   on n1 {
>     address 192.168.52.151:7789;
>     node-id 1;
>   }
>   on n2 {
>     address 192.168.52.152:7789;
>     node-id 2;
>   }
>   on n3 {
>     address 192.168.53.151:7789;
>     node-id 3;
>   }
>   on n4 {
>     address 192.168.53.152:7789;
>     node-id 4;
>   }
>
>   connection-mesh {
>     hosts n1 n2 n3 n4;
>   }
> }
>
> Is there any more information I can provide that would help to track down
> this issue?
>
> Regards,
> Jeremy Faith
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user at lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>


You should definitely stop pacemaker before corosync and start in the
opposite order.

-- 








Know Your Customer due diligence on demand, powered by intelligent 
process automation




Blogs <https://www.encompasscorporation.com/blog/>  
|  LinkedIn <https://www.linkedin.com/company/encompass-corporation/>  |  
Twitter <https://twitter.com/EncompassCorp>

 




Encompass Corporation UK 
Ltd  |  Company No. SC493055  |  Address: Level 3, 33 Bothwell Street, 
Glasgow, UK, G2 6NL

Encompass Corporation Pty Ltd  |  ACN 140 556 896  |  
Address: Level 10, 117 Clarence Street, Sydney, New South Wales, 2000

This 
email and any attachments is intended only for the use of the individual or 
entity named above and may contain confidential information. 

If you are 
not the intended recipient, any dissemination, distribution or copying of 
this email is prohibited. 

If received in error, please notify us 
immediately by return email and destroy the original message.








-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20201020/c0817bf6/attachment-0001.htm>


More information about the drbd-user mailing list