[DRBD-user] fence (unfence really) versus corosync (slow startup)

Christian Balzer chibi at gol.com
Mon Mar 27 04:47:52 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

Since Debian Jessie for some time now has pacemaker (again) in backports,
this is actually my first corosync+pacemaker+DRBD based cluster, plenty
of them based on Wheezy and thus heartbeat (and SysV Init).

I'm not sure if to blame systemd (see the startup time) or the rather
sedate nature of corosync when it come to adding member nodes, but the CIB
fencing, unfence in particular, isn't working as expected when one simply
reboots a node in an idle cluster.

Since there is no dirty data, the resync finishes instantly and the
unfence script is called, long before the just rebooted node has become a
corosync and consequently a pacemaker cluster member again.

To wit:
---

Mar 27 10:57:33 mbx12 kernel: [   21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
Mar 27 10:57:33 mbx12 kernel: [   21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C
Mar 27 10:57:33 mbx12 kernel: [   21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
Mar 27 10:57:33 mbx12 kernel: [   21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12
Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected
Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations
Mar 27 10:57:33 mbx12 kernel: [   21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100)

Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine...

Mar 27 10:57:38 mbx12 crmd[2300]:   notice: The local CRM is operational

---

Restarting pacemaker on the rebooted node _after_ corosync and pacemaker
membership has been established "fixes" this of course.
In the case of a real failure on a busy cluster the resync would have
likely taken significantly more than 5 seconds and things would have
worked as well.

I'm wondering if others have seen this, if there are tuning or dependency
settings for corosync I'm missing or if it's just a question of inserting
a "long enough" sleep into the unfence script (ouch). 

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/



More information about the drbd-user mailing list