[DRBD-user] dopd failover

Rois Cannon rois at cobiz.com
Thu Dec 13 22:35:14 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Before I start trying the patch thing (and since this is an rpm and I
don't know where to get the source or how to apply a patch) I've been
continuing down the permission problem path.

I'll admit I'm not great with complicated permissions but this seems to go
farther the more I goof around with the permissions and ownership.

Once I put the "others" execute permissions back on drbdmeta and drbdsetup the
outdate feature started working when I ran it manually from node1.

I'm guessing that's because dopd is running as hacluster the the group 
permissions are for halclient.However, now I've got a new problem.  
drbddisk gives a critical error and won't take
over the resource and eventually gives up.  node2 does outdate the peer
but won't take over as primary.  Here is what cat /proc/drbd shows when the 
dust settles:
SVN Revision: 3048 build by phil at mescal, 2007-09-03 10:39:27
 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/Outdated C r---
    ns:2 nr:139737 dw:139739 dr:0 al:0 bm:40 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:8692 misses:10 starving:0 dirty:0 changed:10
        act_log: used:0/257 hits:4 misses:0 starving:0 dirty:0 changed:0

You can see that node2 has the peer outdated which is correct but it 
isn't able to become primary so that heartbeat can use it.

Logs, config files and file permissions to follow:
I kill the power on node1 (primary) and node2 gets this error:
Dec 13 13:08:17 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home start
Dec 13 13:08:17 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home start
Dec 13 13:08:17 svr92 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Dec 13 13:08:17 svr92 ipfail: [9068]: debug: Found ping node 192.168.151.1!
Dec 13 13:08:18 svr92 ipfail: [9068]: info: NS: We are still alive!
Dec 13 13:08:19 svr92 /usr/lib/heartbeat/dopd: [9069]: info: send_message_to_the_peer: sending start_outdate message to the other node svr92 -> svr91
Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home start done. RC=20
Dec 13 13:08:36 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk
Dec 13 13:08:36 svr92 ResourceManager[9144]: CRIT: Giving up resources due to failure of drbddisk::home
Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Releasing resource group: svr91 IPaddr::192.168.151.90/24/eth0 drbddisk::home Filesystem::/dev/drbd0::/home::xfs
Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop
Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop
Dec 13 13:08:36 svr92 Filesystem[9431]: INFO: Running stop for /dev/drbd0 on /home
Dec 13 13:08:36 svr92 Filesystem[9428]: INFO:  Success
Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs stop done. RC=0
Dec 13 13:08:36 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home stop
Dec 13 13:08:36 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home stop
Dec 13 13:08:41 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home stop done. RC=20
Dec 13 13:08:41 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk
Dec 13 13:08:42 svr92 ResourceManager[9144]: info: Retrying failed stop operation [drbddisk::home]
Dec 13 13:08:42 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/drbddisk home stop
Dec 13 13:08:42 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/drbddisk home stop
Dec 13 13:08:43 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/drbddisk home stop done. RC=20
Dec 13 13:08:43 svr92 ResourceManager[9144]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk

[repeats a bunch]

Dec 13 13:09:15 svr92 ResourceManager[9144]: ERROR: Resource script for drbddisk::home probably not LSB-compliant.
Dec 13 13:09:15 svr92 ResourceManager[9144]: WARN: it (drbddisk::home) MUST succeed on a stop when already stopped
Dec 13 13:09:15 svr92 ResourceManager[9144]: WARN: Machine reboot narrowly avoided!
Dec 13 13:09:15 svr92 ResourceManager[9144]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop
Dec 13 13:09:15 svr92 ResourceManager[9144]: debug: Starting /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop
Dec 13 13:09:15 svr92 IPaddr[9959]: INFO: /sbin/ifconfig eth0:0 192.168.151.90 down
Dec 13 13:09:15 svr92 IPaddr[9938]: INFO:  Success
Dec 13 13:09:15 svr92 ResourceManager[9144]: debug: /etc/ha.d/resource.d/IPaddr 192.168.151.90/24/eth0 stop done. RC=0
Dec 13 13:09:15 svr92 mach_down[9124]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
Dec 13 13:09:16 svr92 mach_down[9124]: info: mach_down takeover complete for node svr91.
Dec 13 13:09:16 svr92 heartbeat: [9040]: info: mach_down takeover complete.
Dec 13 13:09:17 svr92 kernel: drbd0: outdate-peer helper returned 5
Dec 13 13:09:17 svr92 kernel: drbd0: role( Secondary -> Primary ) pdsk( DUnknown -> Outdated )
Dec 13 13:09:17 svr92 kernel: drbd0: Creating new current UUID
Dec 13 13:09:17 svr92 kernel: drbd0: Writing meta data super block now.
Dec 13 13:09:17 svr92 kernel: drbd0: role( Primary -> Secondary )
Dec 13 13:09:17 svr92 kernel: drbd0: Writing meta data super block now.
Dec 13 13:09:46 svr92 hb_standby[10007]: Going standby [foreign].
Dec 13 13:09:46 svr92 heartbeat: [9040]: info: svr92 wants to go standby [foreign]
Dec 13 13:09:56 svr92 heartbeat: [9040]: WARN: No reply to standby request.  Standby request cancelled.


[root at svr92 sbin]# ll /sbin/drbd* /usr/sbin/drbd*
lrwxrwxrwx 1 root root        17 2007-12-13 10:16 /sbin/drbdadm -> /usr/sbin/drbdadm*
lrwxrwxrwx 1 root root        18 2007-12-13 10:17 /sbin/drbdmeta -> /usr/sbin/drbdmeta*
lrwxrwxrwx 1 root root        19 2007-12-13 10:17 /sbin/drbdsetup -> /usr/sbin/drbdsetup*
-rwxr-xr-x 1 root root     70088 2007-09-06 04:05 /usr/sbin/drbdadm*
-rwsr-xr-x 1 root haclient 47840 2007-09-06 04:05 /usr/sbin/drbdmeta*
-rwsr-xr-x 1 root haclient 33804 2007-09-06 04:05 /usr/sbin/drbdsetup*

drbd.conf:
global {
    usage-count no;
}
common {
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -p";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -p";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -p";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater";
  }
  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
    fencing resource-only;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "[deleted]";
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 10M;
    al-extents 257;
  }
}
resource home {
  protocol C;
  on svr91 {
    device     /dev/drbd0;
    disk       /dev/vg0/home;
    address    192.168.1.91:7788;
    meta-disk  internal;
  }
  on svr92 {
    device     /dev/drbd0;
    disk       /dev/vg0/home;
    address    192.168.1.92:7788;
    meta-disk  internal;
  }
}

ha.cf:
auto_failback off
logfacility     local0
debugfile /var/log/ha-debug
keepalive 2
warntime 4
deadtime 12
deadping 6
initdead 30
baud 115200
serial /dev/ttyS0
ucast eth0 192.168.151.91 192.168.151.92
ucast eth1 192.168.1.91 192.168.1.92
node svr91 svr92
ping 192.168.151.1
ping 192.168.1.3
respawn hacluster /usr/lib/heartbeat/ipfail
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster





More information about the drbd-user mailing list