[DRBD-user] fence (unfence really) versus corosync (slow startup)
chibi at gol.com
Wed May 10 04:43:09 CEST 2017
On Wed, 29 Mar 2017 12:01:43 +0200 Lars Ellenberg wrote:
> On Mon, Mar 27, 2017 at 11:47:52AM +0900, Christian Balzer wrote:
> > Hello,
> > Since Debian Jessie for some time now has pacemaker (again) in backports,
> > this is actually my first corosync+pacemaker+DRBD based cluster, plenty
> > of them based on Wheezy and thus heartbeat (and SysV Init).
> > I'm not sure if to blame systemd (see the startup time) or the rather
> > sedate nature of corosync when it come to adding member nodes, but the CIB
> > fencing, unfence in particular, isn't working as expected when one simply
> > reboots a node in an idle cluster.
> > Since there is no dirty data, the resync finishes instantly and the
> > unfence script is called, long before the just rebooted node has become a
> > corosync and consequently a pacemaker cluster member again.
> > To wit:
> > ---
> > Mar 27 10:57:33 mbx12 kernel: [ 21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> > Mar 27 10:57:33 mbx12 kernel: [ 21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C
> > Mar 27 10:57:33 mbx12 kernel: [ 21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
> > Mar 27 10:57:33 mbx12 kernel: [ 21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh: invoked for mb12
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh: Signon to CIB failed: Transport endpoint is not connected
> > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh: Init failed, could not perform requested operations
> > Mar 27 10:57:33 mbx12 kernel: [ 21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100)
> "crm-unfence-peer.sh" tries to remove a pacemaker constraint.
> So don't start DRBD before you start Pacemaker.
Quite aware of that, sysv dependencies handled that beautifully.
> Have Pacemaker start DRBD, then the CIB will be available for unfence.
Easier said than done, as Debian only supplies the LSB init script, which
a "systemctl disable" will NOT disable.
For now brutal "exit 0" solves that, bug report to Debian in progress...
> > Mar 27 10:57:36 mbx12 systemd: Starting Corosync Cluster Engine...
> > Mar 27 10:57:38 mbx12 crmd: notice: The local CRM is operational
> > ---
> > Restarting pacemaker on the rebooted node _after_ corosync and pacemaker
> > membership has been established "fixes" this of course.
> > In the case of a real failure on a busy cluster the resync would have
> > likely taken significantly more than 5 seconds and things would have
> > worked as well.
> > I'm wondering if others have seen this, if there are tuning or dependency
> > settings for corosync I'm missing or if it's just a question of inserting
> > a "long enough" sleep into the unfence script (ouch).
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Rakuten Communications
More information about the drbd-user