Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Mar 27, 2017 at 11:47:52AM +0900, Christian Balzer wrote: > > Hello, > > Since Debian Jessie for some time now has pacemaker (again) in backports, > this is actually my first corosync+pacemaker+DRBD based cluster, plenty > of them based on Wheezy and thus heartbeat (and SysV Init). > > I'm not sure if to blame systemd (see the startup time) or the rather > sedate nature of corosync when it come to adding member nodes, but the CIB > fencing, unfence in particular, isn't working as expected when one simply > reboots a node in an idle cluster. > > Since there is no dirty data, the resync finishes instantly and the > unfence script is called, long before the just rebooted node has become a > corosync and consequently a pacemaker cluster member again. > > To wit: > --- > > Mar 27 10:57:33 mbx12 kernel: [ 21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) > Mar 27 10:57:33 mbx12 kernel: [ 21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C > Mar 27 10:57:33 mbx12 kernel: [ 21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) > Mar 27 10:57:33 mbx12 kernel: [ 21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12 > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations > Mar 27 10:57:33 mbx12 kernel: [ 21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100) "crm-unfence-peer.sh" tries to remove a pacemaker constraint. So don't start DRBD before you start Pacemaker. Have Pacemaker start DRBD, then the CIB will be available for unfence. > Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine... > > Mar 27 10:57:38 mbx12 crmd[2300]: notice: The local CRM is operational > > --- > > Restarting pacemaker on the rebooted node _after_ corosync and pacemaker > membership has been established "fixes" this of course. > In the case of a real failure on a busy cluster the resync would have > likely taken significantly more than 5 seconds and things would have > worked as well. > > I'm wondering if others have seen this, if there are tuning or dependency > settings for corosync I'm missing or if it's just a question of inserting > a "long enough" sleep into the unfence script (ouch). -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker DRBD® and LINBIT® are registered trademarks of LINBIT __ please don't Cc me, but send to list -- I'm subscribed