Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Don't know how to summarize what I've encountered, therefore the rather lame subject. I'm running a HA setup of 2 nodes on RHEL5U3, and I have done the following tests: - configure resources, everything runs on cluster1 - hard reboot of both active and passive nodes (cluster1 and cluster2) at the same time (echo b > /proc/sysrq-trigger) - after reboot, checked resources, all went fine, they started on cluster1 - hard rebooted active node (cluster1) - second node detects failure of cluster1, but gets stuck when it tries to start the drbd resources Dec 03 16:39:40 cluster2 crmd: [1699]: info: process_lrm_event: LRM operation ping_gw:0_monitor_0 (call=13, rc=7, cib-update=16, confirmed=true) not running Dec 03 16:39:41 cluster2 crm_attribute: [2585]: info: Invoked: crm_attribute -N cluster2 -n master-drbd_mysql:1 -l reboot -D Dec 03 16:39:41 cluster2 crm_attribute: [2584]: info: Invoked: crm_attribute -N cluster2 -n master-drbd_home:1 -l reboot -D Dec 03 16:39:41 cluster2 crm_attribute: [2583]: info: Invoked: crm_attribute -N cluster2 -n master-drbd_storage:1 -l reboot -D Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash entry for master-drbd_home:1 Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash entry for master-drbd_storage:1 Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash entry for master-drbd_mysql:1 Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM operation drbd_storage:1_monitor_0 (call=12, rc=7, cib-update=17, confirmed=true) not running Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM operation drbd_home:1_monitor_0 (call=10, rc=7, cib-update=18, confirmed=true) not running Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM operation drbd_mysql:1_monitor_0 (call=11, rc=7, cib-update=19, confirmed=true) not running Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash entry for probe_complete Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_perform_update: Sent update 10: probe_complete=true Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing key=96:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=ping_gw:0_start_0 ) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:ping_gw:0:14: start Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing key=14:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_home:1_start_0 ) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_home:1:15: start Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing key=42:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_mysql:1_start_0 ) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_mysql:1:16: start Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing key=70:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_storage:1_start_0 ) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_storage:1:17: start Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:start:stdout) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_storage:1:start:stdout) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_home:1:start:stdout) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_storage:1:start:stdout) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:start:stdout) Dec 03 16:39:41 cluster2 crm_attribute: [2746]: info: Invoked: crm_attribute -N cluster2 -n master-drbd_storage:1 -l reboot -v 5 Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_storage:1 (5) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:start:stderr) 0: Failure: (124) Device is attached to a disk (use detach first) Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:start:stderr) Command 'drbdsetup 0 disk Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:start:stderr) /dev/xvda5 /dev/xvda5 internal --set-defaults --create-device --fencing=resource-only --on-io-e rror=detach' terminated with exit code 10 And then continues with Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) 0: State change failed: (-7) Refusing to be Primary while peer is not outdated Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) Command ' Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) drbdsetup Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) 0 Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) primary Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) ' terminated with exit code 11 Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stdout) Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) 0: State change failed: (-7) Refusing to be Primary while peer is not outdated Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) Command 'drbdsetup 0 primary Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: (drbd_mysql:1:promote:stderr) ' terminated with exit code 11 Anyways, long story short, several minutes later and hundreds of messages in the log file the cluster settles into a working state. The RPM's I used are # rpm -qa | grep '(corosync|drbd|pacemaker|resource-agents|openais)' corosynclib-1.3.0-1 openaislib-1.1.3-1.6.el5 drbd83-8.3.2-6.el5_3 corosync-1.3.0-1 pacemaker-libs-1.0.10-1.4.el5 openais-1.1.3-1.6.el5 pacemaker-1.0.10-1.4.el5 kmod-drbd83-xen-8.3.2-6.el5_3 resource-agents-1.0.3-2.6.el5 Attached is a hb_report. I've never experienced this kind of behavior before, is this normal? Regards, Dan -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania -------------- next part -------------- A non-text attachment was scrubbed... Name: hb_report.tar.bz2 Type: application/octet-stream Size: 211078 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101203/ce6d9452/attachment.obj>