Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Before I start trying the patch thing (and since this is an rpm and I don't know where to get the source or how to apply a patch) I've been continuing down the permission problem path. I'll admit I'm not great with complicated permissions but this seems to go farther the more I goof around with the permissions and ownership. Once I put the "others" execute permissions back on drbdmeta and drbdsetup the outdate feature started working when I ran it manually from node1. I'm guessing that's because dopd is running as hacluster the the group permissions are for halclient.However, now I've got a new problem. drbddisk gives a critical error and won't take over the resource and eventually gives up. node2 does outdate the peer but won't take over as primary. Here is what cat /proc/drbd shows when the dust settles: SVN Revision: 3048 build by phil at mescal, 2007-09-03 10:39:27 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/Outdated C r--- ns:2 nr:139737 dw:139739 dr:0 al:0 bm:40 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:8692 misses:10 starving:0 dirty:0 changed:10 act_log: used:0/257 hits:4 misses:0 starving:0 dirty:0 changed:0 You can see that node2 has the peer outdated which is correct but it isn't able to become primary so that heartbeat can use it. Logs, config files and file permissions to follow: I kill the power on node1 (primary) and node2 gets this error: Dec 13 13:08:17 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home start Dec 13 13:08:17 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home start Dec 13 13:08:17 svr92 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer Dec 13 13:08:17 svr92 ipfail: [9068]: debug: Found ping node 192.168.151.1! Dec 13 13:08:18 svr92 ipfail: [9068]: info: NS: We are still alive! Dec 13 13:08:19 svr92 /usr/lib/heartbeat/dopd: [9069]: info: send_message_to_the_peer: sending start_outdate message to the other node svr92 -> svr91 Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home start done. RC=20 Dec 13 13:08:36 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk Dec 13 13:08:36 svr92 ResourceManager[9144]: CRIT: Giving up resources due to failure of drbddisk::home Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Releasing resource group: svr91 IPaddr::192.168.151.90/24/eth0 drbddisk::home Filesystem::/dev/drbd0::/home::xfs Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop Dec 13 13:08:36 svr92 Filesystem[9431]: INFO: Running stop for /dev/drbd0 on /home Dec 13 13:08:36 svr92 Filesystem[9428]: INFO: Success Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop done. RC=0 Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home stop Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home stop Dec 13 13:08:41 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home stop done. RC=20 Dec 13 13:08:41 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk Dec 13 13:08:42 svr92 ResourceManager[9144]: info: Retrying failed stop operation [drbddisk::home] Dec 13 13:08:42 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home stop Dec 13 13:08:42 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home stop Dec 13 13:08:43 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home stop done. RC=20 Dec 13 13:08:43 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk [repeats a bunch] Dec 13 13:09:15 svr92 ResourceManager[9144]: ERROR: Resource script for drbddisk::home probably not LSB-compliant. Dec 13 13:09:15 svr92 ResourceManager[9144]: WARN: it (drbddisk::home) MUST succeed on a stop when already stopped Dec 13 13:09:15 svr92 ResourceManager[9144]: WARN: Machine reboot narrowly avoided! Dec 13 13:09:15 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop Dec 13 13:09:15 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop Dec 13 13:09:15 svr92 IPaddr[9959]: INFO: /sbin/ifconfig eth0:0 192.168.151.90 down Dec 13 13:09:15 svr92 IPaddr[9938]: INFO: Success Dec 13 13:09:15 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop done. RC=0 Dec 13 13:09:15 svr92 mach_down[9124]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired Dec 13 13:09:16 svr92 mach_down[9124]: info: mach_down takeover complete for node svr91. Dec 13 13:09:16 svr92 heartbeat: [9040]: info: mach_down takeover complete. Dec 13 13:09:17 svr92 kernel: drbd0: outdate-peer helper returned 5 Dec 13 13:09:17 svr92 kernel: drbd0: role( Secondary -> Primary ) pdsk( DUnknown -> Outdated ) Dec 13 13:09:17 svr92 kernel: drbd0: Creating new current UUID Dec 13 13:09:17 svr92 kernel: drbd0: Writing meta data super block now. Dec 13 13:09:17 svr92 kernel: drbd0: role( Primary -> Secondary ) Dec 13 13:09:17 svr92 kernel: drbd0: Writing meta data super block now. Dec 13 13:09:46 svr92 hb_standby[10007]: Going standby [foreign]. Dec 13 13:09:46 svr92 heartbeat: [9040]: info: svr92 wants to go standby [foreign] Dec 13 13:09:56 svr92 heartbeat: [9040]: WARN: No reply to standby request. Standby request cancelled. [root at svr92 sbin]# ll /sbin/drbd* /usr/sbin/drbd* lrwxrwxrwx 1 root root 17 2007-12-13 10:16 /sbin/drbdadm -> /usr/sbin/drbdadm* lrwxrwxrwx 1 root root 18 2007-12-13 10:17 /sbin/drbdmeta -> /usr/sbin/drbdmeta* lrwxrwxrwx 1 root root 19 2007-12-13 10:17 /sbin/drbdsetup -> /usr/sbin/drbdsetup* -rwxr-xr-x 1 root root 70088 2007-09-06 04:05 /usr/sbin/drbdadm* -rwsr-xr-x 1 root haclient 47840 2007-09-06 04:05 /usr/sbin/drbdmeta* -rwsr-xr-x 1 root haclient 33804 2007-09-06 04:05 /usr/sbin/drbdsetup* drbd.conf: global { usage-count no; } common { handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -p"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -p"; local-io-error "echo o > /proc/sysrq-trigger ; halt -p"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; } startup { degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; fencing resource-only; } net { cram-hmac-alg "sha1"; shared-secret "[deleted]"; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 10M; al-extents 257; } } resource home { protocol C; on svr91 { device /dev/drbd0; disk /dev/vg0/home; address 192.168.1.91:7788; meta-disk internal; } on svr92 { device /dev/drbd0; disk /dev/vg0/home; address 192.168.1.92:7788; meta-disk internal; } } ha.cf: auto_failback off logfacility local0 debugfile /var/log/ha-debug keepalive 2 warntime 4 deadtime 12 deadping 6 initdead 30 baud 115200 serial /dev/ttyS0 ucast eth0 192.168.151.91 192.168.151.92 ucast eth1 192.168.1.91 192.168.1.92 node svr91 svr92 ping 192.168.151.1 ping 192.168.1.3 respawn hacluster /usr/lib/heartbeat/ipfail respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd gid=haclient uid=hacluster