[DRBD-user] heartbeat+drbd scenario: question on particular case

Fri Aug 22 19:14:09 CEST 2008

On Fri, Aug 22, 2008 at 04:06:51PM +0200, Gianluca Cecchi wrote:
>  Hello all, my first post.
> it is quite long, but I try to give details.
> 
> I have 2xcentos5.2 servers with heartbeat+drbd
> two eth channels and no stonith at the moment (suggestions in this respect
> are welcome, yet to analyze in deep)
> packages versions are
> heartbeat-2.1.3-3.el5.centos
> drbd82-8.2.6-1.el5.centos
> kmod-drbd82-8.2.6-1.2.6.18_92.el5
> kernel-2.6.18-92.el5
> 
> I intend to provide nfs services in HA and consulted many docs, arriving at
> a specific config (see below)
> At the moment I'm planning to give only a primary/slave service on one nfs
> resource
> 
> I'm trying to consider and simulate various planned/unplanned scenarios and
> at the moment I have this with doubts about heartbeat and/or drbd config and
> relative "service" behaviour.
> Excuse if it could be offtopic in case of a misconfiguration of heartbeat
> 
> I have the cluster running:
> nfsnode1 active and master
> nfsnode2 active and slave
> 
> heartbeat resource is
> nfsnode2 drbddisk::drbd-resource-0 \
>         Filesystem::/dev/drbd0::/drbd0::ext3 \
>         killnfsd \
>         nfslock \
>         nfs \
>         Delay::3::0 \
>         IPaddr::10.4.5.103/24/eth0

heartbeat v1 (haresources) style config
cannot cope with your described situation.

> drbd status in nfsnode2 is and remains
> 0:drbd-resource-0  WFConnection  Secondary/Unknown  Outdated/DUnknown  C
> 
> and in the log 6 times:
> Aug 22 12:57:37 nfsnode2 kernel: drbd0: State change failed: Refusing to be Primary without at least one UpToDate disk
> Aug 22 12:57:37 nfsnode2 kernel: drbd0:   state = { cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown r--- }
> Aug 22 12:57:37 nfsnode2 kernel: drbd0:  wanted = { cs:WFConnection st:Primary/Unknown ds:Outdated/DUnknown r--- }
> ...
> Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2683]: ERROR: Return code 1 from /etc/ha.d/resource.d/drbddisk
> Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2684]: CRIT: Giving up resources due to failure of drbddisk::drbd-resource-0
> Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2685]: info: Releasing resource group:
     nfsnode2 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/drbd0::ext3 killnfsd nfslock nfs Delay::3::0 IPaddr::10.4.5.103/24/eth0
> ...
> Aug 22 12:58:13 nfsnode2 hb_standby[3047]: [3053]: Going standby [foreign].
> Aug 22 12:58:14 nfsnode2 heartbeat: [2204]: info: nfsnode2 wants to go standby [foreign]
> Aug 22 12:58:24 nfsnode2 heartbeat: [2204]: WARN: No reply to standby request.  Standby request cancelled.

nfsnode2 is "homenode", but could not start the resources.
intentionally,
as DRBD was outdated,
and correctly refuses to become primary.

> Now startup nfsnode1 step 6):
> the node nfsnode1 correctly executes sync of drbd data versus nfsnode2 that
> was not aligned

> and so heartbeat doesn't activate nfs service and relative virtual ip

right.
nfsnode1 is not the home node,
and apparently concludes from seeing the home node alive that the
resources are running there alright, and takes no action.

aparently heartbeat in haresources (non-crm) mode
cannot cope with your scenario.

> heartbeat logs give:
> Aug 22 13:06:12 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'up'
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth0 up.
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Status update for node nfsnode2: status active
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: WARN: G_CH_dispatch_int: Dispatch function for read child took to o long to execute: 100 ms (> 50 ms) (GSource: 0x83b8940)
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth1 up.
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Comm_now_up(): updating status to active
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'active'
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Starting child client "/usr/lib/heartbeat/dopd" (498,496)
> Aug 22 13:06:13 nfsnode1 heartbeat: [2188]: info: Starting "/usr/lib/heartbeat/dopd" as uid 498  gid 496 (pid  2188)
> Aug 22 13:06:13 nfsnode1 harc[2186]: [2196]: info: Running /etc/ha.d/rc.d/status status
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed.
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed.
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local Resource acquisition completed. (none)
> Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Initial resource
> acquisition complete (T_RESOURCES(them))
> 
> Any hints and suggestions?
> Thanks in advance

if you want to cope with multiple failures, operator intervention is
almost always required. in this scenario, if you want to stick with
haresources style heartbeat config, you probably have to say something like
# /usr/lib/heartbeat/ResourceManager takegroup drbddisk::drbd-resource-0
could do the trick now. (but don't tell anybody)

it may even be a heartbeat "bug", but I doubt that linux-ha guys are
eager to change anything in the officially unmaintained "non-crm code".

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed