[DRBD-user] failover resulted in split brain

Fri Feb 28 17:37:00 CET 2014

You need fencing to avoid split-brains. I'm not too familiar with 
proxmox; Does it use cman + rgmanager or pacemaker behind the scenes?

In either case, configure fencing in the cluster stack (called 'stonith' 
in pacemaker), then configure DRBD to block and call a fence when the 
peer is lost. This is done by setting 'fencing resource-and-stonith;' 
and then setting 'fence-handler 
/path/to/{rhcs_fence,crm-fence-peer.sh};'. Which you use depends on 
which cluster stack you are using.

This way, when drbd would have split-brain, it will instead block until 
the peer is fenced, ensuring that when it returns to writing, the peer 
is guaranteed to not be doing the same.

digimer

On 28/02/14 07:24 AM, Gerald Brandt wrote:
> Hi,
>
> I'm doing tests on a new DRBD setup, so I'm hammering the DRBD system
> with reads and writes (3 VMs writing with dd and three VMs reading with
> dd).  The test max's out my 2x1GigE bonded links (both data and sync)
> and max's out my hard drives (5 7200 RPM SATA, RAID6).  I share the drbd
> disks to Proxmox (KVM based) via NFS v3.
>
> 1. I tested the system all night, and both DRBD servers handled
> everything fine.
> 2. I reboot the primary
> 3. failover of the IP and NFS worked, and secondary became primary.
> 4. reboot server came back up, and entered slit-brain.
>
> I use uCarp for the failover instead of heartbeat/pacemaker.
>
> I've used iSCSI over DRBD/heartbeat before, but not NFS.  Any ideas why
> I hit split brain?
>
> Gerald
>
>
> drbd.conf
> # cat /etc/drbd.conf
> # You can find an example in /usr/share/doc/drbd.../drbd.conf.example
>
> include "drbd.d/global_common.conf";
> # include "drbd.d/*.res";
>
> resource target.0 {
>          protocol C;
>
>          handlers {
>          pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
>          pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
>          local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
>          outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>          before-resync-target /usr/local/bin/resync-start-RAID6.sh;
>          after-resync-target /usr/local/bin/resync-end-RAID6.sh;
>          }
>
>          startup {
>          degr-wfc-timeout 120;
>          }
>
>          disk {
>          on-io-error detach;
>          }
>
>          net {
>          cram-hmac-alg sha1;
>          shared-secret "password";
>          after-sb-0pri disconnect;
>          after-sb-1pri disconnect;
>          after-sb-2pri disconnect;
>          rr-conflict disconnect;
>          sndbuf-size 0;
>          }
>
>          syncer {
>          c-plan-ahead 0;
>          rate 30M;
>          verify-alg sha1;
> #        al-extents 257;
>          al-extents 3389;
>          }
>
>          on iscsi-filer-1 {
>          device  /dev/drbd0;
>          disk    /dev/md0;
>          address 192.168.10.1:7789;
>          flexible-meta-disk /dev/md3;
>          }
>
>          on iscsi-filer-2 {
>          device  /dev/drbd0;
>          disk    /dev/md0;
>          address 192.168.10.2:7789;
>          flexible-meta-disk /dev/md3;
>          }
> }
>
> resource target.2 {
>          protocol C;
>
>          handlers {
>          pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
>          pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
>          local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
>          outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>          before-resync-target /usr/local/bin/resync-start-RAID5.sh;
>          after-resync-target /usr/local/bin/resync-end-RAID5.sh;
>          }
>
>          startup {
>          degr-wfc-timeout 120;
>          }
>
>          disk {
>          on-io-error detach;
>          }
>
>          net {
>          cram-hmac-alg sha1;
>          shared-secret "password";
>          after-sb-0pri disconnect;
>          after-sb-1pri disconnect;
>          after-sb-2pri disconnect;
>          rr-conflict disconnect;
>          sndbuf-size 0;
>          }
>
>          syncer {
>          c-plan-ahead 0;
>          rate 30M;
>          verify-alg sha1;
> #        al-extents 257;
>          al-extents 3389;
>          }
>
>          on iscsi-filer-1 {
>          device  /dev/drbd2;
>          disk    /dev/md2;
>          address 192.168.10.1:7790;
>          flexible-meta-disk /dev/md4;
>          }
>
>          on iscsi-filer-2 {
>          device  /dev/drbd2;
>          disk    /dev/md2;
>          address 192.168.10.2:7790;
>          flexible-meta-disk /dev/md4;
>          }
> }
>
>
> ucarp-up
> #!/bin/sh                                                                                                                                                           &n
> bsp; ;
>   &nb sp;
> /sbin/drbdadm primary all
> /sbin/ifup $1:ucarp
> /sbin/drbdadm primary all
> /sbin/drbdadm primary all
> /sbin/drbdadm primary all
> mount -o defaults,noatime,nodiratime /dev/drbd0 /nfs-exported/raid6
> mount -o defaults,noatime,nodiratime /dev/drbd2 /nfs-exported/raid5
> /etc/init.d/nfs-kernel-server restart
> sleep 2
> echo 256 > /proc/fs/nfsd/threads
>
>
> ucarp-down
> #!/bin/sh                                                                                                                                                           &n
> bsp; ;
>   &nb sp;
> /etc/init.d/nfs-kernel-server stop
> umount /nfs-exported/raid6
> umount /nfs-exported/raid5
> /sbin/drbdadm secondary all
> /sbin/ifdown $1:ucarp
>
>
>
> --
> Gerald Brandt
> Majentis Technologies
> gbr at majentis.com
> 204-229-6595
> www.majentis.com
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?