Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, On Wed, 29 Mar 2017 12:01:43 +0200 Lars Ellenberg wrote: > On Mon, Mar 27, 2017 at 11:47:52AM +0900, Christian Balzer wrote: > > > > Hello, > > > > Since Debian Jessie for some time now has pacemaker (again) in backports, > > this is actually my first corosync+pacemaker+DRBD based cluster, plenty > > of them based on Wheezy and thus heartbeat (and SysV Init). > > > > I'm not sure if to blame systemd (see the startup time) or the rather > > sedate nature of corosync when it come to adding member nodes, but the CIB > > fencing, unfence in particular, isn't working as expected when one simply > > reboots a node in an idle cluster. > > > > Since there is no dirty data, the resync finishes instantly and the > > unfence script is called, long before the just rebooted node has become a > > corosync and consequently a pacemaker cluster member again. > > > > To wit: > > --- > > > > Mar 27 10:57:33 mbx12 kernel: [ 21.177649] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) > > Mar 27 10:57:33 mbx12 kernel: [ 21.177653] block drbd1: updated UUIDs B48E58521182F05E:0000000000000000:2025C1AFB245A42C:2024C1AFB245A42C > > Mar 27 10:57:33 mbx12 kernel: [ 21.177658] block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) > > Mar 27 10:57:33 mbx12 kernel: [ 21.177796] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 > > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: invoked for mb12 > > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Signon to CIB failed: Transport endpoint is not connected > > Mar 27 10:57:33 mbx12 crm-unfence-peer.sh[2151]: Init failed, could not perform requested operations > > Mar 27 10:57:33 mbx12 kernel: [ 21.211736] block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 1 (0x100) > > "crm-unfence-peer.sh" tries to remove a pacemaker constraint. > So don't start DRBD before you start Pacemaker. > Quite aware of that, sysv dependencies handled that beautifully. > Have Pacemaker start DRBD, then the CIB will be available for unfence. > Easier said than done, as Debian only supplies the LSB init script, which a "systemctl disable" will NOT disable. For now brutal "exit 0" solves that, bug report to Debian in progress... Christian > > > Mar 27 10:57:36 mbx12 systemd[1]: Starting Corosync Cluster Engine... > > > > Mar 27 10:57:38 mbx12 crmd[2300]: notice: The local CRM is operational > > > > --- > > > > Restarting pacemaker on the rebooted node _after_ corosync and pacemaker > > membership has been established "fixes" this of course. > > In the case of a real failure on a busy cluster the resync would have > > likely taken significantly more than 5 seconds and things would have > > worked as well. > > > > I'm wondering if others have seen this, if there are tuning or dependency > > settings for corosync I'm missing or if it's just a question of inserting > > a "long enough" sleep into the unfence script (ouch). > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/