[DRBD-user] fence (unfence really) versus corosync (slow startup)

Wed Mar 29 12:01:43 CEST 2017

On Mon, Mar 27, 2017 at 11:47:52AM +0900, Christian Balzer wrote:
> 
> Hello,
> 
> Since Debian Jessie for some time now has pacemaker (again) in backports,
> this is actually my first corosync+pacemaker+DRBD based cluster, plenty
> of them based on Wheezy and thus heartbeat (and SysV Init).
> 
> I'm not sure if to blame systemd (see the startup time) or the rather
> sedate nature of corosync when it come to adding member nodes, but the CIB
> fencing, unfence in particular, isn't working as expected when one simply
> reboots a node in an idle cluster.
> 
> Since there is no dirty data, the resync finishes instantly and the
> unfence script is called, long before the just rebooted node has become a
> corosync and consequently a pacemaker cluster member again.
> 
> To wit:
> ---
> 
> Mar 27 10:57:33 mbx12 kernel: [   21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> Mar 27 10:57:33 mbx12 kernel: [   21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C
> Mar 27 10:57:33 mbx12 kernel: [   21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
> Mar 27 10:57:33 mbx12 kernel: [   21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
> Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12
> Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected
> Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations
> Mar 27 10:57:33 mbx12 kernel: [   21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100)

"crm-unfence-peer.sh" tries to remove a pacemaker constraint.
So don't start DRBD before you start Pacemaker.

Have Pacemaker start DRBD, then the CIB will be available for unfence.

> Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine...
> 
> Mar 27 10:57:38 mbx12 crmd[2300]:   notice: The local CRM is operational
> 
> ---
> 
> Restarting pacemaker on the rebooted node _after_ corosync and pacemaker
> membership has been established "fixes" this of course.
> In the case of a real failure on a busy cluster the resync would have
> likely taken significantly more than 5 seconds and things would have
> worked as well.
> 
> I'm wondering if others have seen this, if there are tuning or dependency
> settings for corosync I'm missing or if it's just a question of inserting
> a "long enough" sleep into the unfence script (ouch). 

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed