Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Sep 18, 2008 at 10:29:33AM +0200, Christoph Eßer wrote: > Ooops, gmail incorrectly labeled this mail as sent to the Linux-HA > list. This is the right one. Sorry for disturbance. and it is much too long... > I added a complete log for the reboot process on my slave machine. > Maybe this helps. you could have looked at those logs yourself, you know? > 2008/9/18 Christoph Eßer <cehser at gmail.com>: > > Sorry, I accidentially sent the message again to this list. just ignore it. > > > > 2008/9/18 Christoph Eßer <cehser at gmail.com>: > >> Hi there, > >> > >> I first came up with this question on Linux-HA mailing list. As they think > >> it semms to be a DRBD related problem, I repeat it here: > >> > >> I am using Heartbeat 2.99,Pacemaker 0.6.6 and DRBD 8.0.13 on Debian Lenny. > >> > >> Anything works as expected if I just force the system to failover by > >> unplugging the power supply of any of my two nodes. Shutting down the > >> DRBD slave via "init 0" or "init 6" works fine as well. > >> > >> But whenever I restart my DRBD master via "init 6" the whole cluster > >> crashes and fails to reassign the resources, even after the node started > >> up again. Anyway, I found some strange DRBD related messages in > >> /var/log/syslog on the second node while restarting the master node: > Sep 18 08:14:22 viktor-01 drbd[3863]: [3900]: DEBUG: r0 status: Secondary/Primary Secondary Primary Connected fine. > Sep 18 08:15:19 viktor-01 kernel: [ 274.894318] drbd0: peer( Primary -> Secondary ) ok, peer is going secondary. > Sep 18 08:15:24 viktor-01 kernel: [ 279.772756] drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) peer disconnected > Sep 18 08:15:24 viktor-01 kernel: [ 279.772756] drbd0: conn( Unconnected -> WFConnection ) we go wait for the peer to come back. > Sep 18 08:15:25 viktor-01 cib: [2279]: info: apply_xml_diff: Digest mis-match: expected 615bd497ee2d3cfdec1bc8611c9af264, calculated dd22ea8b07a5c626d0c03c021d1c3045 dunno whether this is normal behaviour of heartbeat: > Sep 18 08:15:25 viktor-01 cib: [2279]: info: cib_process_diff: Diff 0.209.194 -> 0.209.195 not applied to 0.209.194: Failed application of a global update. Requesting full refresh. > Sep 18 08:15:25 viktor-01 cib: [2279]: info: cib_process_diff: Requesting re-sync from peer: Failed application of a global update. Requesting full refresh. > Sep 18 08:15:25 viktor-01 cib: [2279]: WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed, requesting a full refresh > Sep 18 08:15:25 viktor-01 drbd[4021]: [4050]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf cstate r0 > Sep 18 08:15:25 viktor-01 cib: [2279]: WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed, requesting a full refresh > Sep 18 08:15:25 viktor-01 cib: [2279]: WARN: cib_process_diff: Not applying diff 0.209.195 -> 0.209.196 (sync in progress) > Sep 18 08:15:25 viktor-01 drbd[4021]: [4057]: DEBUG: r0: Exit code 0 > Sep 18 08:15:25 viktor-01 cib: [2279]: WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed, requesting a full refresh > Sep 18 08:15:25 viktor-01 drbd[4021]: [4058]: DEBUG: r0: Command output: WFConnection > Sep 18 08:15:25 viktor-01 cib: [2279]: WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed, requesting a full refresh > Sep 18 08:15:28 viktor-01 crmd: [2283]: notice: crmd_client_status_callback: Status update: Client viktor-02/crmd now has status [offline] > Sep 18 08:15:28 viktor-01 crmd: [2283]: info: crm_update_peer_proc: viktor-02.crmd is now offline but finally peer is offline from heartbeats perspective as well. > Sep 18 08:15:28 viktor-01 crmd: [2283]: info: crmd_client_status_callback: Got client status callback - our DC is dead unfortunately it was the DC, so heartbeat elects a new DC. > Sep 18 08:15:28 viktor-01 crmd: [2283]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=crmd_client_status_callback ] > Sep 18 08:15:28 viktor-01 crmd: [2283]: info: update_dc: Unset DC viktor-02 ... > Sep 18 08:15:35 viktor-01 ccm: [2278]: info: Break tie for 2 nodes cluster > Sep 18 08:15:35 viktor-01 crmd: [2283]: info: ccm_event_detail: NEW MEMBERSHIP: trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3 ... > Sep 18 08:15:35 viktor-01 crmd: [2283]: info: update_dc: Set DC to viktor-01 (2.1) there. we are the DC. > Sep 18 08:15:36 viktor-01 crmd: [2283]: info: do_state_transition: All 1 cluster nodes responded to the join offer. cool. I agree with myself. we start the resource stack: > Sep 18 08:15:37 viktor-01 pengine: [4066]: WARN: native_color: Resource drbd:0 cannot run anywhere > Sep 18 08:15:37 viktor-01 pengine: [4066]: info: master_color: Promoting drbd:1 (Slave viktor-01) ... > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: NoRoleChange: Promote drbd:1#011(Slave -> Master viktor-01) > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: RecurringOp: Start recurring monitor (29s) for drbd:1 on viktor-01 > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start extern > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start eth3:0 ... > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start fs_mount > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start Exim4 > Sep 18 08:15:37 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start Squid ... > Sep 18 08:15:37 viktor-01 crmd: [2283]: info: do_lrm_rsc_op: Performing op=drbd:1_promote_0 key=7:0:0:48972b3a-fc9c-405d-bf6c-1385730bb2b3) > Sep 18 08:15:38 viktor-01 drbd[4080]: [4117]: DEBUG: r0 status: Secondary/Unknown Secondary Unknown WFConnection > Sep 18 08:15:38 viktor-01 drbd[4080]: [4118]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf primary r0 > Sep 18 08:15:38 viktor-01 kernel: [ 293.423767] drbd0: helper command: /sbin/drbdadm outdate-peer you configured drbd to use the outdate-peer handler. so it does. > Sep 18 08:15:38 viktor-01 drbd-peer-outdater: [4124]: debug: drbd peer: viktor-02 > Sep 18 08:15:38 viktor-01 drbd-peer-outdater: [4124]: debug: drbd resource: r0 > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Connecting channel > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Client outdater (0x9864a50) connected > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: invoked: outdater > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Processing msg from outdater > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Got message from (drbd-peer-outdater). (peer: viktor-02, res :r0) > Sep 18 08:15:38 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Starting node walk > Sep 18 08:15:39 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: node viktor-02 found > Sep 18 08:15:39 viktor-01 /usr/lib/heartbeat/dopd: [2284]: info: sending start_outdate message to the other node viktor-01 -> viktor-02 > Sep 18 08:15:39 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: sending [start_outdate res: r0] to node: viktor-02 > Sep 18 08:15:39 viktor-01 /usr/lib/heartbeat/dopd: [2284]: debug: Processed 1 messages meanwhile the peer rebooted? > Sep 18 08:15:41 viktor-01 kernel: [ 297.000380] e1000: eth2: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX > Sep 18 08:15:44 viktor-01 kernel: [ 300.008048] drbd0: connect failed, err = -115 > Sep 18 08:15:54 viktor-01 kernel: [ 310.108051] drbd0: connect failed, err = -115 but has no drbd port open yet. > Sep 18 08:15:57 viktor-01 lrmd: [2280]: WARN: drbd:1:promote process (PID 4080) timed out (try 1). Killing with signal SIGTERM (15). the timeout on the promote is too short, compared to the timeout of drbd-peer-outdater. add "-t 5" to the drbd-peer-outdater command in drbd.conf, or increase the promote timeouts. > Sep 18 08:15:57 viktor-01 lrmd: [2280]: WARN: operation promote[54] on ocf::drbd::drbd:1 for client 2283, its parameters: clone_max=[2] CRM_meta_role=[Master] drbdconf=[/etc/drbd.conf] CRM_meta_notify_start_resource=[drbd:0 drbd:1 ] CRM_meta_notify_promote_resource=[drbd:1 ] drbd_resource=[r0] CRM_meta_master_max=[1] CRM_meta_stateful=[true] CRM_meta_timeout=[20000] crm_feature_set=[2.1] CRM_meta_clone_max=[2] CRM_meta_master_node_max=[1] master_node_max=[1] master_max=[1] CRM_meta_notify_promote_uname=[viktor-01 ] CRM_meta_notify_start_uname=[viktor-: pid [4080] timed out > Sep 18 08:15:57 viktor-01 crmd: [2283]: ERROR: process_lrm_event: LRM operation drbd:1_promote_0 (54) Timed Out (timeout=20000ms) default timeout for drbd-peer-outdater is 60 seconds. promote timeout is set to 20 seconds. too bad. "recovery action" of heartbeat for a failed promote is probably "stop" "start". while drbd itself is still trying to become primary, once that drbd-peer-outdate helper returns. hm, thats going to hurt. ... blabla ... > Sep 18 08:15:58 viktor-01 pengine: [4066]: info: master_color: Promoting drbd:1 (Master viktor-01) > Sep 18 08:15:58 viktor-01 pengine: [4066]: info: master_color: ms_drbd_ip: Promoted 1 instances of a possible 1 to master > Sep 18 08:15:58 viktor-01 pengine: [4066]: info: master_color: ms_drbd_ip: Promoted 1 instances of a possible 1 to master > Sep 18 08:15:58 viktor-01 pengine: [4066]: notice: NoRoleChange: Leave resource drbd:1#011(Master viktor-01) heartbeat seems to be confused. wants to promote the master? but it is already master? ok, so it tries to demote first: > Sep 18 08:15:58 viktor-01 lrmd: [2280]: info: rsc:drbd:1: demote > Sep 18 08:15:58 viktor-01 crmd: [2283]: info: do_lrm_rsc_op: Performing op=drbd:1_demote_0 key=6:1:0:48972b3a-fc9c-405d-bf6c-1385730bb2b3) > Sep 18 08:15:58 viktor-01 drbd[4137]: [4148]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf secondary r0 > Sep 18 08:15:59 viktor-01 drbd[4137]: [4155]: DEBUG: r0: Exit code 0 > Sep 18 08:15:59 viktor-01 drbd[4137]: [4156]: DEBUG: r0: Command output: No response from the DRBD driver! Is the module loaded? Command 'drbdsetup /dev/drbd0 secondary' terminated with exit code 20 drbdadm aborting hu? did not work? well, not, because the promotion is still pending... so what is it? exit code 20, or exit code 0? > Sep 18 08:15:59 viktor-01 lrmd: [2280]: info: RA output: (drbd:1:demote:stdout) No response from the DRBD driver! Is the module loaded? Command 'drbdsetup /dev/drbd0 secondary' terminated with exit code 20 drbdadm aborting sure it does not answer to role change requests, as there is a role change request still pending. > Sep 18 08:16:00 viktor-01 heartbeat: [2173]: WARN: node viktor-02: is dead > Sep 18 08:16:00 viktor-01 crmd: [2283]: notice: crmd_ha_status_callback: Status update: Node viktor-02 now has status [dead] > Sep 18 08:16:00 viktor-01 heartbeat: [2173]: info: Link viktor-02:eth0 dead. > Sep 18 08:16:00 viktor-01 crmd: [2283]: info: crm_update_peer_proc: viktor-02.ais is now offline > Sep 18 08:16:00 viktor-01 heartbeat: [2173]: info: Link viktor-02:eth1 dead. uh? I thought we knew that already? no, we only knew it was offline and the former DC was dead. whatever, nit picking on wording. ... > Sep 18 08:16:01 viktor-01 pengine: [4066]: info: master_color: Promoting drbd:1 (Slave viktor-01) and more "recovery actions" > Sep 18 08:16:01 viktor-01 pengine: [4066]: info: master_color: ms_drbd_ip: Promoted 1 instances of a possible 1 to master > Sep 18 08:16:01 viktor-01 pengine: [4066]: info: master_color: ms_drbd_ip: Promoted 1 instances of a possible 1 to master > Sep 18 08:16:01 viktor-01 pengine: [4066]: notice: NoRoleChange: Promote drbd:1#011(Slave -> Master viktor-01) > Sep 18 08:16:01 viktor-01 pengine: [4066]: notice: StopRsc: viktor-01#011Stop drbd:1 > Sep 18 08:16:01 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start drbd:1 yes, now it wants to _stop_ (which in heartbeat ocf drbd ra means even unload the drbd module) and then start again... stupid timeouts. ... > Sep 18 08:16:02 viktor-01 lrmd: [2280]: info: rsc:drbd:1: stop > Sep 18 08:16:02 viktor-01 crmd: [2283]: info: do_lrm_rsc_op: Performing op=drbd:1_stop_0 key=1:2:0:48972b3a-fc9c-405d-bf6c-1385730bb2b3) > Sep 18 08:16:02 viktor-01 drbd[4214]: [4226]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf state r0 > Sep 18 08:16:02 viktor-01 drbd[4214]: [4242]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf cstate r0 > Sep 18 08:16:03 viktor-01 drbd[4214]: [4249]: DEBUG: r0: Exit code 0 > Sep 18 08:16:03 viktor-01 drbd[4214]: [4250]: DEBUG: r0: Command output: No response from the DRBD driver! Is the module loaded? Command 'drbdsetup /dev/drbd0 cstate' terminated with exit code 20 drbdadm aborting > Sep 18 08:16:03 viktor-01 lrmd: [2280]: info: RA output: (drbd:1:stop:stdout) > Sep 18 08:16:03 viktor-01 drbd[4214]: [4257]: DEBUG: r0 stop: already unconfigured. and the ra thinks that drbd was already stopped. not good, as it is not. stupid stupid timeouts. now, the start: > Sep 18 08:16:04 viktor-01 pengine: [4066]: notice: StartRsc: viktor-01#011Start drbd:0 ... and so on. > Sep 18 08:16:38 viktor-01 drbd-peer-outdater: [4124]: WARN: error: could not connect to dopd after 60 seconds: timeout reached there. finally. although that message is missleading: drbd-peer-outdater was able to connect to local dopd, but local dopd was not able to connect to remote dopd, and when it was invoked, aparently it was not yet able to determine that there was no point in trying to contact the remote dopd, as it was offline (but not dead). or some such. anyways, full timeout expired. > Sep 18 08:16:57 viktor-01 kernel: [ 373.064043] drbd0: Handshake successful: DRBD Network Protocol version 86 but now already the peer rebooted, and drbd re-establishes the connection. > Sep 18 08:16:57 viktor-01 kernel: [ 373.064133] drbd0: conn( WFConnection -> WFReportParams ) > Sep 18 08:16:57 viktor-01 kernel: [ 373.076877] drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) > Sep 18 08:16:57 viktor-01 kernel: [ 373.076934] drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > Sep 18 08:16:58 viktor-01 heartbeat: [2173]: info: Heartbeat restart on node viktor-02 you aparently really have drbd started from init _and_ use the heartbeat ocf drbd ra. not good, as heartbeat (with that drbd ocf ra) insists on doing it itself, and gets confused if it was done already. > Sep 18 08:17:04 viktor-01 pengine: [4066]: info: determine_online_status: Node viktor-01 is online > Sep 18 08:17:04 viktor-01 pengine: [4066]: info: unpack_rsc_op: Remapping drbd:0_monitor_0 (rc=0) on viktor-01 to an ERROR (expected 7) right. heartbeat had its own picture of how drbd's state "should" be. but it got it wrong. reality looks different from what heartbeat thinks it should look like. ... and we need an other membership update > Sep 18 08:17:10 viktor-01 tengine: [4065]: info: run_graph: Throttling output: batch limit (30) reached uh? is that normal, or is the heartbeat graph processing looping? anyways, heartbeat tries to make reality look "correct" again, by trying to stop drbd. > Sep 18 08:17:23 viktor-01 tengine: [4065]: info: send_rsc_command: Initiating action 15: stop drbd:1_stop_0 on viktor-02 > Sep 18 08:17:25 viktor-01 kernel: [ 400.847494] drbd0: State change failed: Refusing to be Primary while peer is not outdated > Sep 18 08:17:25 viktor-01 kernel: [ 400.847494] drbd0: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- } > Sep 18 08:17:25 viktor-01 kernel: [ 400.847494] drbd0: wanted = { cs:TearDown st:Primary/Unknown ds:UpToDate/DUnknown r--- } > Sep 18 08:17:25 viktor-01 kernel: [ 400.849214] drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) ok, that did work (it needed two tries, as the first try could not yet confirm "out-of-date-ness" of the peer, but it did work). so, viktor-01 is still drbd Primary, while viktor-02 aparently now has been unconfigured. > Sep 18 08:17:29 viktor-01 pengine: [4066]: notice: clone_print: Master/Slave Set: ms_drbd_ip > Sep 18 08:17:29 viktor-01 pengine: [4066]: notice: native_print: drbd:0#011(ocf::heartbeat:drbd):#011Started viktor-02 > Sep 18 08:17:29 viktor-01 pengine: [4066]: notice: native_print: drbd:1#011(ocf::heartbeat:drbd):#011Stopped hm. that still does not mach reality. > Sep 18 08:17:29 viktor-01 pengine: [4066]: WARN: native_color: Resource drbd:1 cannot run anywhere > Sep 18 08:17:29 viktor-01 pengine: [4066]: info: master_color: Promoting drbd:0 (Slave viktor-02) nonsense. drbd:1 _IS_ still running (and Primary) on viktor-01. viktor-02 has just been unconfigured. at least the latter should be known by heartbeat, as it initiated (and successfully completed) it. > Sep 18 08:17:29 viktor-01 pengine: [4066]: notice: NoRoleChange: Promote drbd:0#011(Slave -> Master viktor-02) that won't work. > Sep 18 08:17:30 viktor-01 tengine: [4065]: info: send_rsc_command: Initiating action 8: promote drbd:0_promote_0 on viktor-02 > Sep 18 08:17:32 viktor-01 crmd: [2283]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] > Sep 18 08:17:32 viktor-01 tengine: [4065]: info: status_from_rc: Re-mapping op status to LRM_OP_ERROR for rc=1 > Sep 18 08:17:32 viktor-01 crmd: [2283]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. > Sep 18 08:17:32 viktor-01 tengine: [4065]: WARN: status_from_rc: Action 8 (promote) on viktor-02 failed (target: 0 vs. rc: 1): Error right. you cannot promote an unconfigured resource. > Sep 18 08:17:32 viktor-01 pengine: [4066]: notice: native_print: drbd:0#011(ocf::heartbeat:drbd):#011Master viktor-02 FAILED OF COURSE. this is strange, why would heartbeat try to unconfigure and then promote, without a start in between? > Sep 18 08:17:32 viktor-01 pengine: [4066]: notice: native_print: drbd:1#011(ocf::heartbeat:drbd):#011Stopped this still runs on viktor-01, as Primary! > Sep 18 08:17:32 viktor-01 pengine: [4066]: WARN: native_color: Resource drbd:1 cannot run anywhere nonsense. it does. > Sep 18 08:17:32 viktor-01 pengine: [4066]: info: master_color: Promoting drbd:0 (Master viktor-02) why promote it? it should first be started! ah. right: > Sep 18 08:17:34 viktor-01 tengine: [4065]: info: send_rsc_command: Initiating action 7: demote drbd:0_demote_0 on viktor-02 > Sep 18 08:17:40 viktor-01 tengine: [4065]: info: send_rsc_command: Initiating action 1: stop drbd:0_stop_0 on viktor-02 > Sep 18 08:17:42 viktor-01 tengine: [4065]: info: send_rsc_command: Initiating action 5: start drbd:0_start_0 on viktor-02 there we are again. > Sep 18 08:17:43 viktor-01 kernel: [ 418.532147] drbd0: Handshake successful: DRBD Network Protocol version 86 > Sep 18 08:17:43 viktor-01 kernel: [ 418.534929] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > Sep 18 08:17:43 viktor-01 kernel: [ 418.549743] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) > Sep 18 08:17:43 viktor-01 kernel: [ 418.549961] drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > Sep 18 08:17:45 viktor-01 tengine: [4065]: info: match_graph_event: Action drbd:0_start_0 (5) confirmed on viktor-02 (rc=0) > Sep 18 08:17:48 viktor-01 pengine: [4066]: WARN: unpack_rsc_op: drbd:0_monitor_0 found active drbd:0 on viktor-02 uh? > Sep 18 08:17:48 viktor-01 pengine: [4066]: WARN: unpack_rsc_op: drbd:1_monitor_0 found active drbd:1 on viktor-02 how can it find both clones active on the same node? maybe I just still don't get the concept of the clones/master/slave resources? I though I did, though. very strange. and so on. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.