[DRBD-user] crm-fence-peer.sh & maintenance / reboots

Fri Aug 3 16:09:27 CEST 2012

----- Original Message -----
> From: "Dirk Bonenkamp - ProActive" <dirk at proactive.nl>
> To: drbd-user at lists.linbit.com
> Sent: Friday, August 3, 2012 4:17:46 AM
> Subject: Re: [DRBD-user] crm-fence-peer.sh & maintenance / reboots
> 
> Hi all,
> 
> I'm still struggling with this problem. Since my last mail, I've
> simplified my setup: 1 DRBD resource with only 1 file system
> resource. I
> normally have stonith in place & working, but this is also removed
> for
> simplicity.
> 
> Things that work as expected:
> - Pulling the dedicated drdb network cable. Location constraint is
> created as expected (preventing promotion of the now unconnected
> slave
> node). The constraint gets removed after re-plugging the cable.
> - Rebooting the slave node / putting the slave node in stanby mode.
> No
> constraints (as expected), no problems.
> - Migrating the file system resource. File system unmounts, slave
> node
> becomes master, file system mounts, no problems.
> 
> Things that do not work as expected:
> - Rebooting the master node / putting the master node in standby
> mode.
> The location constraint is created, which prevents the slave becoming
> master... To correct this, I have to put the old master node on-line
> again and have to remove the constraint by hand.
> 
> My setup:
> Ubuntu 10.04 running 2.6.32-41-generic / x86_64
> DRBD 8.3.13 (self compiled)

Hi Dirk!

Might be this bug affecting fencing I found when using the -41 kernel in Ubuntu with DRBD 8.3.13:

https://bugs.launchpad.net/ubuntu/+source/drbd8/+bug/1000355

> Pacemaker 1.1.6 (from HA maintainers PPA)
> Corosync 1.4.2 (from HA maintainers PPA)
> 
> Network:
> 10.0.0.0/24 on eth0: network for 'normal' connectivity
> 172.16.0.1 <-> 172.16.0.2 on eth1: dedicated network for DRBD
> 
> corosync-cfgtool -s output:
> 
> Printing ring status.
> Local node ID 16781484
> RING ID 0
>         id      = 172.16.0.1
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 10.0.0.71
>         status  = ring 1 active with no faults

Look here for a second step required to verify corosync rings are actually OK when it's only a two node cluster:
http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership

> 
> Configuration files:
> http://pastebin.com/VUgHcuQ0
> 
> Log of a failed failover (master node):
> http://pastebin.com/f5amFMzY
> 
> Log of a failed failover (slave node):
> http://pastebin.com/QHBPnHFQ

How about output of /proc/drbd and crm configure show when master node in standby?

HTH
Jake