[DRBD-user] NFS Connections Dropped, But NFS still running

Wed May 2 16:22:47 CEST 2012

Hello, I'm looking for advice on how to troubleshoot an issue I experienced on one of my DRBD clusters. I'm running:
- Ubuntu 11.10 (GNU/Linux 3.0.0-12-server x86_64)
- DRBD 8.3.11
- Pacemaker 1.1.5 / Heartbeat 3.0.5

Primary Node: filer2a
Secondary Node: filer2b

I'm running NFS which shares house virtual machines.  I noticed VMs were unresponsive, so looking on the primary node I ran:

- 'netstat | grep nfs' & 'showmount -a' and saw that none of my KVM hosts were connected.
- 'ps aux | grep nfs' and saw NFS was still running.
- 'drbd-overview' and saw ' 0:r0  Connected Primary/Secondary UpToDate/UpToDate C r----- /mnt/drbd0 ext4 11T 1001G 9.1T 10%'

I went to one KVM host and could not touch a file on my NFS share. SO I restarted NFS and within seconds saw the KVM hosts reconnect to the share. And to my surprise, I was able to login the, once hung, VMs without any issues, guess NFS cache did its job?!

I've looked in my syslog file and I've included below the timeframe of logs I believe is when an issue occurred, 12:58p...

Can anyone help me decipher what the issue may have been or to the more experienced DRBD admins, is there something completely obvious here. Are there other logs I should be looking at?

Thank you!
Basher

Syslog:

Apr 30 12:58:20 Filer2A lrmd: [1684]: WARN: res_IPaddr2_1:monitor process (PID 18454) timed out (try 1).  Killing with signal SIGTERM (15).
Apr 30 12:58:20 Filer2A lrmd: [1684]: WARN: operation monitor[11] on ocf::IPaddr2::res_IPaddr2_1 for client 1687, its parameters: CRM_meta_start_delay=[0] crm_feature_set=[3.0.5] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[10.254.253.205] : pid [18454] timed out
Apr 30 12:58:20 Filer2A crmd: [1687]: ERROR: process_lrm_event: LRM operation res_IPaddr2_1_monitor_10000 (11) Timed Out (timeout=20000ms)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_graph_event: Action res_IPaddr2_1_monitor_10000/44 (2:-2;44:8:0:198b5a2e-56cd-4d8a-9c0a-694ad16b8e82) initiated by a different transitioner
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: process_graph_event:467 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=res_IPaddr2_1_monitor_10000, magic=2:-2;44:8:0:198b5a2e-56cd-4d8a-9c0a-694ad16b8e82, cib=0.18.529) : Foreign event
Apr 30 12:58:21 Filer2A crmd: [1687]: WARN: update_failcount: Updating failcount for res_IPaddr2_1 on filer2a after failed monitor: rc=-2 (update=value++, time=1335808701)
Apr 30 12:58:21 Filer2A attrd: [1686]: info: find_hash_entry: Creating hash entry for fail-count-res_IPaddr2_1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_local_callback: Expanded fail-count-res_IPaddr2_1=value++ to 1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res_IPaddr2_1 (1)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke: Query 4764: Requesting the current CIB: S_POLICY_ENGINE
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_perform_update: Sent update 127: fail-count-res_IPaddr2_1=1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: find_hash_entry: Creating hash entry for last-failure-res_IPaddr2_1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res_IPaddr2_1 (1335808701)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke_callback: Invoking the PE: query=4764, ref=pe_calc-dc-1335808701-4715, seq=14, quorate=1
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation res_drbd_1:1_monitor_0 found resource res_drbd_1:1 active on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: WARN: unpack_rsc_op: Processing failed op res_IPaddr2_1_monitor_10000 on filer2a: unknown exec error (-2)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation res_drbd_1:0_monitor_0 found resource res_drbd_1:0 active on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: clone_print:  Master/Slave Set: ms_drbd_1 [res_drbd_1]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Masters: [ filer2a ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Slaves: [ filer2b ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_Filesystem_1#011(ocf::heartbeat:Filesystem):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_IPaddr2_1#011(ocf::heartbeat:IPaddr2):#011Started filer2a FAILED
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_nfs-kernel-server_1#011(lsb:nfs-kernel-server):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: res_nfs-kernel-server_1 can fail 999997 more times on filer2a before being forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: RecurringOp:  Start recurring monitor (10s) for res_IPaddr2_1 on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_drbd_1:0#011(Master filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_drbd_1:1#011(Slave filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_Filesystem_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Recover res_IPaddr2_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_nfs-kernel-server_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: unpack_graph: Unpacked transition 4619: 4 actions in 4 synapses
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_te_invoke: Processing graph 4619 (ref=pe_calc-dc-1335808701-4715) derived from /var/lib/pengine/pe-input-4664.bz2
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 5: stop res_IPaddr2_1_stop_0 on filer2a (local)
Apr 30 12:58:21 Filer2A lrmd: [1684]: info: cancel_op: operation monitor[11] on ocf::IPaddr2::res_IPaddr2_1 for client 1687, its parameters: CRM_meta_start_delay=[0] crm_feature_set=[3.0.5] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[10.254.253.205]  cancelled
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_lrm_rsc_op: Performing key=5:4619:0:600e8087-898a-424d-9199-a8dc2b8898c0 op=res_IPaddr2_1_stop_0 )
Apr 30 12:58:21 Filer2A lrmd: [1684]: info: rsc:res_IPaddr2_1:37: stop
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_lrm_event: LRM operation res_IPaddr2_1_monitor_10000 (call=11, status=1, cib-update=0, confirmed=true) Cancelled
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-1c5ace0e-6f0e-4b34-817c-b4ea48deef8e-fail-count-res_IPaddr2_1, magic=NA, cib=0.18.530) : Transient attribute: update
Apr 30 12:58:21 Filer2A crmd: [1687]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Apr 30 12:58:21 Filer2A crmd: [1687]: info: update_abort_priority: Abort action done superceeded by restart
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_perform_update: Sent update 130: last-failure-res_IPaddr2_1=1335808701
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-1c5ace0e-6f0e-4b34-817c-b4ea48deef8e-last-failure-res_IPaddr2_1, magic=NA, cib=0.18.531) : Transient attribute: update
Apr 30 12:58:21 Filer2A IPaddr2[18526]: INFO: IP status = ok, IP_CIP=
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_lrm_event: LRM operation res_IPaddr2_1_stop_0 (call=37, rc=0, cib-update=4765, confirmed=true) ok
Apr 30 12:58:21 Filer2A crmd: [1687]: info: match_graph_event: Action res_IPaddr2_1_stop_0 (5) confirmed on filer2a (rc=0)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: run_graph: ====================================================
Apr 30 12:58:21 Filer2A crmd: [1687]: notice: run_graph: Transition 4619 (Complete=1, Pending=0, Fired=0, Skipped=3, Incomplete=0, Source=/var/lib/pengine/pe-input-4664.bz2): Stopped
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_graph_trigger: Transition 4619 is now complete
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke: Query 4766: Requesting the current CIB: S_POLICY_ENGINE
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke_callback: Invoking the PE: query=4766, ref=pe_calc-dc-1335808701-4717, seq=14, quorate=1
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation res_drbd_1:1_monitor_0 found resource res_drbd_1:1 active on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: WARN: unpack_rsc_op: Processing failed op res_IPaddr2_1_monitor_10000 on filer2a: unknown exec error (-2)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation res_drbd_1:0_monitor_0 found resource res_drbd_1:0 active on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: clone_print:  Master/Slave Set: ms_drbd_1 [res_drbd_1]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Masters: [ filer2a ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Slaves: [ filer2b ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_Filesystem_1#011(ocf::heartbeat:Filesystem):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_IPaddr2_1#011(ocf::heartbeat:IPaddr2):#011Stopped
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: res_nfs-kernel-server_1#011(lsb:nfs-kernel-server):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: res_IPaddr2_1 can fail 999999 more times on filer2a before being forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: res_nfs-kernel-server_1 can fail 999997 more times on filer2a before being forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: RecurringOp:  Start recurring monitor (10s) for res_IPaddr2_1 on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_drbd_1:0#011(Master filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_drbd_1:1#011(Slave filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_Filesystem_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Start   res_IPaddr2_1#011(filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   res_nfs-kernel-server_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: unpack_graph: Unpacked transition 4620: 2 actions in 2 synapses
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_te_invoke: Processing graph 4620 (ref=pe_calc-dc-1335808701-4717) derived from /var/lib/pengine/pe-input-4665.bz2
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 41: start res_IPaddr2_1_start_0 on filer2b
Apr 30 12:58:22 Filer2A crmd: [1687]: info: match_graph_event: Action res_IPaddr2_1_start_0 (41) confirmed on filer2b (rc=0)
Apr 30 12:58:22 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 42: monitor res_IPaddr2_1_monitor_10000 on filer2b
Apr 30 12:58:24 Filer2A crmd: [1687]: info: match_graph_event: Action res_IPaddr2_1_monitor_10000 (42) confirmed on filer2b (rc=0)
Apr 30 12:58:24 Filer2A crmd: [1687]: info: run_graph: ====================================================
Apr 30 12:58:24 Filer2A crmd: [1687]: notice: run_graph: Transition 4620 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-4665.bz2): Complete
Apr 30 12:58:24 Filer2A crmd: [1687]: info: te_graph_trigger: Transition 4620 is now complete
Apr 30 12:58:24 Filer2A crmd: [1687]: info: notify_crmd: Transition 4620 status: done - <null>
Apr 30 12:58:24 Filer2A crmd: [1687]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 12:58:24 Filer2A crmd: [1687]: info: do_state_transition: Starting PEngine Recheck Timer

NOTE:  This email, including any attached files, is confidential, may be legally privileged, and is solely for the intended recipient(s).  If you received this email in error, please destroy it and notify us immediately by reply email or phone (636-519-7070).  Any unauthorized use, dissemination, disclosure, copying or printing is strictly prohibited.