Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, Since Debian Jessie for some time now has pacemaker (again) in backports, this is actually my first corosync+pacemaker+DRBD based cluster, plenty of them based on Wheezy and thus heartbeat (and SysV Init). I'm not sure if to blame systemd (see the startup time) or the rather sedate nature of corosync when it come to adding member nodes, but the CIB fencing, unfence in particular, isn't working as expected when one simply reboots a node in an idle cluster. Since there is no dirty data, the resync finishes instantly and the unfence script is called, long before the just rebooted node has become a corosync and consequently a pacemaker cluster member again. To wit: --- Mar 27 10:57:33 mbx12 kernel: [ 21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) Mar 27 10:57:33 mbx12 kernel: [ 21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C Mar 27 10:57:33 mbx12 kernel: [ 21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Mar 27 10:57:33 mbx12 kernel: [ 21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12 Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations Mar 27 10:57:33 mbx12 kernel: [ 21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100) Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine... Mar 27 10:57:38 mbx12 crmd[2300]: notice: The local CRM is operational --- Restarting pacemaker on the rebooted node _after_ corosync and pacemaker membership has been established "fixes" this of course. In the case of a real failure on a busy cluster the resync would have likely taken significantly more than 5 seconds and things would have worked as well. I'm wondering if others have seen this, if there are tuning or dependency settings for corosync I'm missing or if it's just a question of inserting a "long enough" sleep into the unfence script (ouch). Regards, Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/