Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Aug 22, 2008 at 04:06:51PM +0200, Gianluca Cecchi wrote: > Hello all, my first post. > it is quite long, but I try to give details. > > I have 2xcentos5.2 servers with heartbeat+drbd > two eth channels and no stonith at the moment (suggestions in this respect > are welcome, yet to analyze in deep) > packages versions are > heartbeat-2.1.3-3.el5.centos > drbd82-8.2.6-1.el5.centos > kmod-drbd82-8.2.6-1.2.6.18_92.el5 > kernel-2.6.18-92.el5 > > I intend to provide nfs services in HA and consulted many docs, arriving at > a specific config (see below) > At the moment I'm planning to give only a primary/slave service on one nfs > resource > > I'm trying to consider and simulate various planned/unplanned scenarios and > at the moment I have this with doubts about heartbeat and/or drbd config and > relative "service" behaviour. > Excuse if it could be offtopic in case of a misconfiguration of heartbeat > > I have the cluster running: > nfsnode1 active and master > nfsnode2 active and slave > > heartbeat resource is > nfsnode2 drbddisk::drbd-resource-0 \ > Filesystem::/dev/drbd0::/drbd0::ext3 \ > killnfsd \ > nfslock \ > nfs \ > Delay::3::0 \ > IPaddr::10.4.5.103/24/eth0 heartbeat v1 (haresources) style config cannot cope with your described situation. > drbd status in nfsnode2 is and remains > 0:drbd-resource-0 WFConnection Secondary/Unknown Outdated/DUnknown C > > and in the log 6 times: > Aug 22 12:57:37 nfsnode2 kernel: drbd0: State change failed: Refusing to be Primary without at least one UpToDate disk > Aug 22 12:57:37 nfsnode2 kernel: drbd0: state = { cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown r--- } > Aug 22 12:57:37 nfsnode2 kernel: drbd0: wanted = { cs:WFConnection st:Primary/Unknown ds:Outdated/DUnknown r--- } > ... > Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2683]: ERROR: Return code 1 from /etc/ha.d/resource.d/drbddisk > Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2684]: CRIT: Giving up resources due to failure of drbddisk::drbd-resource-0 > Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2685]: info: Releasing resource group: nfsnode2 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/drbd0::ext3 killnfsd nfslock nfs Delay::3::0 IPaddr::10.4.5.103/24/eth0 > ... > Aug 22 12:58:13 nfsnode2 hb_standby[3047]: [3053]: Going standby [foreign]. > Aug 22 12:58:14 nfsnode2 heartbeat: [2204]: info: nfsnode2 wants to go standby [foreign] > Aug 22 12:58:24 nfsnode2 heartbeat: [2204]: WARN: No reply to standby request. Standby request cancelled. nfsnode2 is "homenode", but could not start the resources. intentionally, as DRBD was outdated, and correctly refuses to become primary. > Now startup nfsnode1 step 6): > the node nfsnode1 correctly executes sync of drbd data versus nfsnode2 that > was not aligned > and so heartbeat doesn't activate nfs service and relative virtual ip right. nfsnode1 is not the home node, and apparently concludes from seeing the home node alive that the resources are running there alright, and takes no action. aparently heartbeat in haresources (non-crm) mode cannot cope with your scenario. > heartbeat logs give: > Aug 22 13:06:12 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'up' > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth0 up. > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Status update for node nfsnode2: status active > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: WARN: G_CH_dispatch_int: Dispatch function for read child took to o long to execute: 100 ms (> 50 ms) (GSource: 0x83b8940) > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth1 up. > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Comm_now_up(): updating status to active > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'active' > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Starting child client "/usr/lib/heartbeat/dopd" (498,496) > Aug 22 13:06:13 nfsnode1 heartbeat: [2188]: info: Starting "/usr/lib/heartbeat/dopd" as uid 498 gid 496 (pid 2188) > Aug 22 13:06:13 nfsnode1 harc[2186]: [2196]: info: Running /etc/ha.d/rc.d/status status > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed. > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed. > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local Resource acquisition completed. (none) > Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Initial resource > acquisition complete (T_RESOURCES(them)) > > Any hints and suggestions? > Thanks in advance if you want to cope with multiple failures, operator intervention is almost always required. in this scenario, if you want to stick with haresources style heartbeat config, you probably have to say something like # /usr/lib/heartbeat/ResourceManager takegroup drbddisk::drbd-resource-0 could do the trick now. (but don't tell anybody) it may even be a heartbeat "bug", but I doubt that linux-ha guys are eager to change anything in the officially unmaintained "non-crm code". -- : Lars Ellenberg : LINBIT HA-Solutions GmbH : DRBD®/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT Information Technologies GmbH __ please don't Cc me, but send to list -- I'm subscribed