[DRBD-user] fence (unfence really) versus corosync (slow startup)

Wed May 10 04:43:09 CEST 2017

Hello,

On Wed, 29 Mar 2017 12:01:43 +0200 Lars Ellenberg wrote:

> On Mon, Mar 27, 2017 at 11:47:52AM +0900, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > Since Debian Jessie for some time now has pacemaker (again) in backports,
> > this is actually my first corosync+pacemaker+DRBD based cluster, plenty
> > of them based on Wheezy and thus heartbeat (and SysV Init).
> > 
> > I'm not sure if to blame systemd (see the startup time) or the rather
> > sedate nature of corosync when it come to adding member nodes, but the CIB
> > fencing, unfence in particular, isn't working as expected when one simply
> > reboots a node in an idle cluster.
> > 
> > Since there is no dirty data, the resync finishes instantly and the
> > unfence script is called, long before the just rebooted node has become a
> > corosync and consequently a pacemaker cluster member again.
> > 
> > To wit:
> > ---
> > 
> > Mar 27 10:57:33 mbx12 kernel: [   21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> > Mar 27 10:57:33 mbx12 kernel: [   21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C
> > Mar 27 10:57:33 mbx12 kernel: [   21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
> > Mar 27 10:57:33 mbx12 kernel: [   21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations
> > Mar 27 10:57:33 mbx12 kernel: [   21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100)  
> 
> "crm-unfence-peer.sh" tries to remove a pacemaker constraint.
> So don't start DRBD before you start Pacemaker.
>
Quite aware of that, sysv dependencies handled that beautifully. 
 
> Have Pacemaker start DRBD, then the CIB will be available for unfence.
>
Easier said than done, as Debian only supplies the LSB init script, which
a "systemctl disable" will NOT disable. 
For now brutal "exit 0" solves that, bug report to Debian in progress...
 
Christian
> 
> > Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine...
> > 
> > Mar 27 10:57:38 mbx12 crmd[2300]:   notice: The local CRM is operational
> > 
> > ---
> > 
> > Restarting pacemaker on the rebooted node _after_ corosync and pacemaker
> > membership has been established "fixes" this of course.
> > In the case of a real failure on a busy cluster the resync would have
> > likely taken significantly more than 5 seconds and things would have
> > worked as well.
> > 
> > I'm wondering if others have seen this, if there are tuning or dependency
> > settings for corosync I'm missing or if it's just a question of inserting
> > a "long enough" sleep into the unfence script (ouch).   
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/