Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello all, my first post. it is quite long, but I try to give details. I have 2xcentos5.2 servers with heartbeat+drbd two eth channels and no stonith at the moment (suggestions in this respect are welcome, yet to analyze in deep) packages versions are heartbeat-2.1.3-3.el5.centos drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-1.2.6.18_92.el5 kernel-2.6.18-92.el5 I intend to provide nfs services in HA and consulted many docs, arriving at a specific config (see below) At the moment I'm planning to give only a primary/slave service on one nfs resource I'm trying to consider and simulate various planned/unplanned scenarios and at the moment I have this with doubts about heartbeat and/or drbd config and relative "service" behaviour. Excuse if it could be offtopic in case of a misconfiguration of heartbeat I have the cluster running: nfsnode1 active and master nfsnode2 active and slave heartbeat resource is nfsnode2 drbddisk::drbd-resource-0 \ Filesystem::/dev/drbd0::/drbd0::ext3 \ killnfsd \ nfslock \ nfs \ Delay::3::0 \ IPaddr::10.4.5.103/24/eth0 and config is keepalive 1 deadtime 10 warntime 2 ucast eth0 10.4.5.102 ucast eth1 10.4.192.242 auto_failback off node nfsnode1 node nfsnode2 respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd gid=haclient uid=hacluster use_logd yes actions 1) nfs service provided and write operations on drbd device active from clients 2) shutdown nfsnode2 (so heartbeat and drbd cleanly stop) 3) write operations continue against drbd device (on nfsnode1) while nfsnode2 is powered off 4) shutdown nfsnode1 (so heartbeat and drbd cleanly stop) 5) restart nfsnode2 with nfsnode1 yet powered down (suppose for example wrong action done by an operator during maintenance activities..) ===> I think it should not start the services as it was slave when shut down and so it is .. Good 6) start nfsnode1 ===> I would expect now nfsnode1 carrying on the service as it was the latest master, while the other was slave and both shutdown operations were clean, and infact sync correctly happens between the twos But both drbd resources remain Secondary... Bad (in my opinion) 0:drbd-resource-0 Connected Secondary/Secondary UpToDate/UpToDate C So that heartbeat chain doesn't start and nfs service is not provided my drbd.conf at the moment is: resource "drbd-resource-0" { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; } startup { # wfc-timeout 0; ## Infinite! degr-wfc-timeout 120; ## 2 minutes. } disk { on-io-error detach; fencing resource-only; } net { after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 60M; al-extents 257; } # It is valid to move device, disk and meta-disk to the # resource level. device /dev/drbd0; disk /dev/sdb1; meta-disk internal; on nfsnode1 { address 10.4.192.241:7789; } on nfsnode2 { address 10.4.192.242:7789; } } Some log data: at shutdown of nfsnode2 at step 2), nfsnode1 becomes 0:drbd-resource-0 WFConnection Primary/Unknown UpToDate/Outdated C /drbd0 and in messages: Aug 22 12:42:04 nfsnode1 kernel: drbd0: State change failed: Refusing to be Primary while peer is not outdated Aug 22 12:42:04 nfsnode1 kernel: drbd0: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- } Aug 22 12:42:04 nfsnode1 kernel: drbd0: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown r--- } Aug 22 12:42:04 nfsnode1 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) at restart of nfsnode2 in step 5) it will wait forever connection with nfsnode1 in os console (wfc-timeout=0 ) And this is a safe (?) step because it is not a correct situation I stop the wait writing in console yes+<INVIO> (I simulate the perfect operator... I have been an operator too in the past, so I can tell... ;-) drbd status in nfsnode2 is and remains 0:drbd-resource-0 WFConnection Secondary/Unknown Outdated/DUnknown C and in the log 6 times: Aug 22 12:57:37 nfsnode2 kernel: drbd0: State change failed: Refusing to be Primary without at least one UpToDate disk Aug 22 12:57:37 nfsnode2 kernel: drbd0: state = { cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown r--- } Aug 22 12:57:37 nfsnode2 kernel: drbd0: wanted = { cs:WFConnection st:Primary/Unknown ds:Outdated/DUnknown r--- } ... Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2683]: ERROR: Return code 1 from /etc/ha.d/resource.d/drbddisk Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2684]: CRIT: Giving up resources due to failure of drbddisk::drbd-resource-0 Aug 22 12:57:43 nfsnode2 ResourceManager[2610]: [2685]: info: Releasing resource group: nfsnode2 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/drbd0::ext3 killnfsd nfslock nfs Delay::3::0 IPaddr::10.4.5.103/24/eth0 ... Aug 22 12:58:13 nfsnode2 hb_standby[3047]: [3053]: Going standby [foreign]. Aug 22 12:58:14 nfsnode2 heartbeat: [2204]: info: nfsnode2 wants to go standby [foreign] Aug 22 12:58:24 nfsnode2 heartbeat: [2204]: WARN: No reply to standby request. Standby request cancelled. Now startup nfsnode1 step 6): the node nfsnode1 correctly executes sync of drbd data versus nfsnode2 that was not aligned Aug 22 13:06:11 nfsnode1 kernel: drbd0: Began resync as SyncSource (will sync 304 KB [76 bits set]). Aug 22 13:06:11 nfsnode1 kernel: drbd0: Writing meta data super block now. Aug 22 13:06:11 nfsnode1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 304 K/sec) Aug 22 13:06:11 nfsnode1 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) at the end both nodes are in state Secondary 0:drbd-resource-0 Connected Secondary/Secondary UpToDate/UpToDate C and so heartbeat doesn't activate nfs service and relative virtual ip heartbeat logs give: Aug 22 13:06:12 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'up' Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth0 up. Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Status update for node nfsnode2: status active Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: WARN: G_CH_dispatch_int: Dispatch function for read child took to o long to execute: 100 ms (> 50 ms) (GSource: 0x83b8940) Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Link nfsnode2:eth1 up. Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Comm_now_up(): updating status to active Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local status now set to: 'active' Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Starting child client "/usr/lib/heartbeat/dopd" (498,496) Aug 22 13:06:13 nfsnode1 heartbeat: [2188]: info: Starting "/usr/lib/heartbeat/dopd" as uid 498 gid 496 (pid 2188) Aug 22 13:06:13 nfsnode1 harc[2186]: [2196]: info: Running /etc/ha.d/rc.d/status status Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed. Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: remote resource transition completed. Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Local Resource acquisition completed. (none) Aug 22 13:06:13 nfsnode1 heartbeat: [2040]: info: Initial resource acquisition complete (T_RESOURCES(them)) Any hints and suggestions? Thanks in advance Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080822/cbfaf985/attachment.htm>