[DRBD-user] Large delay when restarting active node

Fri Dec 3 16:15:02 CET 2010

Hi,

Don't know how to summarize what I've encountered, therefore the rather 
lame subject. I'm running a HA setup of 2 nodes on RHEL5U3, and I have 
done the following tests:
- configure resources, everything runs on cluster1
- hard reboot of both active and passive nodes (cluster1 and cluster2) 
at the same time (echo b > /proc/sysrq-trigger)
- after reboot, checked resources, all went fine, they started on cluster1
- hard rebooted active node (cluster1)
- second node detects failure of cluster1, but gets stuck when it tries 
to start the drbd resources

Dec 03 16:39:40 cluster2 crmd: [1699]: info: process_lrm_event: LRM 
operation ping_gw:0_monitor_0 (call=13, rc=7, cib-update=16, 
confirmed=true) not running
Dec 03 16:39:41 cluster2 crm_attribute: [2585]: info: Invoked: 
crm_attribute -N cluster2 -n master-drbd_mysql:1 -l reboot -D
Dec 03 16:39:41 cluster2 crm_attribute: [2584]: info: Invoked: 
crm_attribute -N cluster2 -n master-drbd_home:1 -l reboot -D
Dec 03 16:39:41 cluster2 crm_attribute: [2583]: info: Invoked: 
crm_attribute -N cluster2 -n master-drbd_storage:1 -l reboot -D
Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating 
hash entry for master-drbd_home:1
Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating 
hash entry for master-drbd_storage:1
Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating 
hash entry for master-drbd_mysql:1
Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM 
operation drbd_storage:1_monitor_0 (call=12, rc=7, cib-update=17, 
confirmed=true) not running
Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM 
operation drbd_home:1_monitor_0 (call=10, rc=7, cib-update=18, 
confirmed=true) not running
Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM 
operation drbd_mysql:1_monitor_0 (call=11, rc=7, cib-update=19, 
confirmed=true) not running
Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating 
hash entry for probe_complete
Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_perform_update: Sent 
update 10: probe_complete=true
Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing 
key=96:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=ping_gw:0_start_0 )
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:ping_gw:0:14: start
Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing 
key=14:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_home:1_start_0 )
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_home:1:15: start
Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing 
key=42:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_mysql:1_start_0 )
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_mysql:1:16: start
Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing 
key=70:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_storage:1_start_0 )
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_storage:1:17: start
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:start:stdout)

Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_storage:1:start:stdout)

Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_home:1:start:stdout)

Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_storage:1:start:stdout)

Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:start:stdout)

Dec 03 16:39:41 cluster2 crm_attribute: [2746]: info: Invoked: 
crm_attribute -N cluster2 -n master-drbd_storage:1 -l reboot -v 5
Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd_storage:1 (5)
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:start:stderr) 0: Failure: (124) Device is attached to a 
disk (use detach first)

Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:start:stderr) Command 'drbdsetup 0 disk
Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:start:stderr) /dev/xvda5 /dev/xvda5 internal 
--set-defaults --create-device --fencing=resource-only --on-io-e
rror=detach' terminated with exit code 10

And then continues with

Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) 0: State change failed: (-7) Refusing to 
be Primary while peer is not outdated

Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) Command '
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) drbdsetup
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr)
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) 0
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr)
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) primary
Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) ' terminated with exit code 11

Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stdout)

Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) 0: State change failed: (-7) Refusing to 
be Primary while peer is not outdated

Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) Command 'drbdsetup 0 primary
Dec 03 16:40:50 cluster2 lrmd: [1696]: info: RA output: 
(drbd_mysql:1:promote:stderr) ' terminated with exit code 11

Anyways, long story short, several minutes later and hundreds of 
messages in the log file the cluster settles into a working state.

The RPM's I used are
# rpm -qa | grep '(corosync|drbd|pacemaker|resource-agents|openais)'
corosynclib-1.3.0-1
openaislib-1.1.3-1.6.el5
drbd83-8.3.2-6.el5_3
corosync-1.3.0-1
pacemaker-libs-1.0.10-1.4.el5
openais-1.1.3-1.6.el5
pacemaker-1.0.10-1.4.el5
kmod-drbd83-xen-8.3.2-6.el5_3
resource-agents-1.0.3-2.6.el5

Attached is a hb_report.

I've never experienced this kind of behavior before, is this normal?

Regards,
Dan

-- 
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report.tar.bz2
Type: application/octet-stream
Size: 211078 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101203/ce6d9452/attachment.obj>