[drbd-mc] DRBD UI device issues ...

Fri Nov 6 22:43:54 CET 2009

Ok,

So I've got it going, I now have the GUI controlling /dev/md_d0 <-> /dev/md_d0 as /dev/drbd0.
Looks good.
I made the change suggested, but also added;

      || $major == 254
to;
      get_disk_info()

Next issue :-)

I start up heartbeat and it all looks good for 20 secs.. then it enters a cycle of alternatively rebooting each machine.
While this is quite fun to watch, it's not really what I want.

On the node that stays up I get (nas1);

Nov  6 21:35:01 nas1 crmd: [1501]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Nov  6 21:35:01 nas1 attrd: [1500]: info: find_hash_entry: Creating hash entry for terminate
Nov  6 21:35:01 nas1 attrd: [1500]: info: find_hash_entry: Creating hash entry for shutdown
Nov  6 21:35:01 nas1 attrd: [1500]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Nov  6 21:35:01 nas1 attrd: [1500]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Nov  6 21:35:01 nas1 attrd: [1500]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Nov  6 21:35:01 nas1 attrd: [1500]: info: attrd_ha_callback: flush message from nas1
Nov  6 21:35:01 nas1 attrd: [1500]: info: attrd_ha_callback: flush message from nas1
Nov  6 21:35:02 nas1 crmd: [1501]: notice: crmd_client_status_callback: Status update: Client nas2/crmd now has status [offline] (DC=false)
Nov  6 21:35:02 nas1 crmd: [1501]: info: crm_update_peer_proc: nas2.crmd is now offline

While on the node that dies I get;

Nov  6 21:35:00 nas2 crmd: [1408]: info: te_connect_stonith: Connected
Nov  6 21:35:00 nas2 crmd: [1408]: info: config_query_callback: Checking for expired actions every 900000ms
Nov  6 21:35:00 nas2 crmd: [1408]: info: update_dc: Set DC to nas2 (3.0.1)
Nov  6 21:35:00 nas2 crmd: [1408]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Nov  6 21:35:00 nas2 crmd: [1408]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
Nov  6 21:35:00 nas2 crmd: [1408]: info: do_dc_join_finalize: join-1: Syncing the CIB from nas2 to the rest of the cluster
Nov  6 21:35:00 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/14, version=0.19.0): ok (rc=0)
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: ID nas1 redefined
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: Element node failed to validate attributes
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: Expecting an element instance_attributes, got nothing
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: Element nodes has extra content: node
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: Invalid sequence in interleave
Nov  6 21:35:00 nas2 cib: [1404]: ERROR: Element cib failed to validate content
Nov  6 21:35:00 nas2 cib: [1404]: WARN: cib_perform_op: Updated CIB does not validate against pacemaker-1.0 schema/dtd
Nov  6 21:35:00 nas2 cib: [1404]: WARN: cib_diff_notify: Update (client: crmd, call:15): 0.19.0 -> 0.20.1 (Update does not conform to the configured schema/DTD)
Nov  6 21:35:00 nas2 cib: [1404]: WARN: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/15, version=0.19.0): Update does not conform to the configured schema/DTD (rc=-47)
Nov  6 21:35:00 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/16, version=0.19.0): ok (rc=0)
Nov  6 21:35:00 nas2 crmd: [1408]: info: update_attrd: Connecting to attrd...
Nov  6 21:35:00 nas2 attrd: [1407]: info: find_hash_entry: Creating hash entry for terminate
Nov  6 21:35:00 nas2 attrd: [1407]: info: find_hash_entry: Creating hash entry for shutdown
Nov  6 21:35:00 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas2']/transient_attributes (origin=local/crmd/17, version=0.19.0): ok (rc=0)
Nov  6 21:35:00 nas2 crmd: [1408]: info: erase_xpath_callback: Deletion of "//node_state[@uname='nas2']/transient_attributes": ok (rc=0)
Nov  6 21:35:00 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas2']/lrm (origin=local/crmd/18, version=0.19.0): ok (rc=0)
Nov  6 21:35:00 nas2 crmd: [1408]: info: erase_xpath_callback: Deletion of "//node_state[@uname='nas2']/lrm": ok (rc=0)
Nov  6 21:35:00 nas2 crmd: [1408]: info: do_dc_join_ack: join-1: Updating node state to member for nas2
Nov  6 21:35:00 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas2']/lrm (origin=local/crmd/19, version=0.19.0): ok (rc=0)
Nov  6 21:35:00 nas2 crmd: [1408]: info: erase_xpath_callback: Deletion of "//node_state[@uname='nas2']/lrm": ok (rc=0)
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_dc_join_ack: join-1: Updating node state to member for nas1
Nov  6 21:35:01 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas1']/lrm (origin=local/crmd/21, version=0.19.1): ok (rc=0)
Nov  6 21:35:01 nas2 crmd: [1408]: info: erase_xpath_callback: Deletion of "//node_state[@uname='nas1']/lrm": ok (rc=0)
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Nov  6 21:35:01 nas2 crmd: [1408]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
Nov  6 21:35:01 nas2 attrd: [1407]: info: attrd_ha_callback: flush message from nas1
Nov  6 21:35:01 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas1']/transient_attributes (origin=nas1/crmd/7, version=0.19.2): ok (rc=0)
Nov  6 21:35:01 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='nas1']/lrm (origin=nas1/crmd/8, version=0.19.3): ok (rc=0)
Nov  6 21:35:01 nas2 attrd: [1407]: info: attrd_ha_callback: flush message from nas1
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Nov  6 21:35:01 nas2 crmd: [1408]: info: crm_update_quorum: Updating quorum status to true (call=25)
Nov  6 21:35:01 nas2 crmd: [1408]: info: abort_transition_graph: do_te_invoke:190 - Triggered transition abort (complete=1) : Peer Cancelled
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_pe_invoke: Query 26: Requesting the current CIB: S_POLICY_ENGINE
Nov  6 21:35:01 nas2 attrd: [1407]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Nov  6 21:35:01 nas2 attrd: [1407]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: ID nas1 redefined
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: Element node failed to validate attributes
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: Expecting an element instance_attributes, got nothing
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: Element nodes has extra content: node
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: Invalid sequence in interleave
Nov  6 21:35:01 nas2 cib: [1404]: ERROR: Element cib failed to validate content
Nov  6 21:35:01 nas2 cib: [1404]: WARN: cib_perform_op: Updated CIB does not validate against pacemaker-1.0 schema/dtd
Nov  6 21:35:01 nas2 cib: [1404]: WARN: cib_diff_notify: Update (client: crmd, call:23): 0.19.3 -> 0.20.1 (Update does not conform to the configured schema/DTD)
Nov  6 21:35:01 nas2 cib: [1404]: WARN: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/23, version=0.19.3): Update does not conform to the configured schema/DTD (rc=-47)
Nov  6 21:35:01 nas2 crmd: [1408]: ERROR: default_cib_update_callback: CIB Update failed: Update does not conform to the configured schema/DTD
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: print_xml_formatted: default_cib_update_callback: update:failed: NULL
Nov  6 21:35:01 nas2 crmd: [1408]: ERROR: do_log: FSA: Input I_ERROR from default_cib_update_callback() received in state S_POLICY_ENGINE
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=default_cib_update_callback ]
Nov  6 21:35:01 nas2 crmd: [1408]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: do_election_vote: Not voting in election, we're in state S_RECOVERY
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_dc_release: DC role released
Nov  6 21:35:01 nas2 crmd: [1408]: info: stop_subsystem: Sent -TERM to pengine: [2316]
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_te_control: Transitioner is now inactive
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_te_control: Disconnecting STONITH...
Nov  6 21:35:01 nas2 crmd: [1408]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Nov  6 21:35:01 nas2 crmd: [1408]: notice: Not currently connected.
Nov  6 21:35:01 nas2 crmd: [1408]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Nov  6 21:35:01 nas2 pengine: [2316]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: Terminating the pengine
Nov  6 21:35:01 nas2 crmd: [1408]: info: stop_subsystem: Sent -TERM to pengine: [2316]
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: Waiting for subsystems to exit
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: All subsystems stopped, continuing
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: do_log: FSA: Input I_PENDING from do_election_vote() received in state S_TERMINATE
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: Terminating the pengine
Nov  6 21:35:01 nas2 crmd: [1408]: info: stop_subsystem: Sent -TERM to pengine: [2316]
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: Waiting for subsystems to exit
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: All subsystems stopped, continuing
Nov  6 21:35:01 nas2 crmd: [1408]: info: crmdManagedChildDied: Process pengine:[2316] exited (signal=0, exitcode=0)
Nov  6 21:35:01 nas2 crmd: [1408]: info: pe_msg_dispatch: Received HUP from pengine:[2316]
Nov  6 21:35:01 nas2 crmd: [1408]: info: pe_connection_destroy: Connection to the Policy Engine released
Nov  6 21:35:01 nas2 crmd: [1408]: WARN: do_log: FSA: Input I_RELEASE_SUCCESS from do_dc_release() received in state S_TERMINATE
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_shutdown: All subsystems stopped, continuing
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_lrm_control: Disconnected from the LRM
Nov  6 21:35:01 nas2 ccm: [1403]: info: client (pid=1408) removed from ccm
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_ha_control: Disconnected from Heartbeat
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_cib_control: Disconnecting CIB
Nov  6 21:35:01 nas2 crmd: [1408]: info: crmd_cib_connection_destroy: Connection to the CIB terminated...
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
Nov  6 21:35:01 nas2 crmd: [1408]: ERROR: do_exit: Could not recover from internal error
Nov  6 21:35:01 nas2 crmd: [1408]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Nov  6 21:35:01 nas2 crmd: [1408]: info: do_exit: [crmd] stopped (2)
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_ipc_message: IPC Channel to 1408 is not connected
Nov  6 21:35:01 nas2 heartbeat: [1321]: WARN: Managed /usr/lib/heartbeat/crmd process 1408 exited with return code 2.
Nov  6 21:35:01 nas2 heartbeat: [1321]: EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/crmd
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_via_callback_channel: Delivery of reply to client 1408/f145a72e-408f-48fa-a288-c9a6dee34439 failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_ipc_message: IPC Channel to 1408 is not connected
Nov  6 21:35:01 nas2 cib: [1404]: WARN: cib_notify_client: Notification of client 1408/f145a72e-408f-48fa-a288-c9a6dee34439 failed
Nov  6 21:35:01 nas2 cib: [1404]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="19" num_updates="3" />
Nov  6 21:35:01 nas2 cib: [1404]: info: log_data_element: cib:diff: + <cib dc-uuid="db80a2a6-fa1e-47d5-a7ab-97eb81abeb84" admin_epoch="0" epoch="20" num_updates="1" />
Nov  6 21:35:01 nas2 cib: [1404]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/25, version=0.20.1): ok (rc=0)
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_ipc_message: IPC Channel to 1408 is not connected
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_via_callback_channel: Delivery of reply to client 1408/f145a72e-408f-48fa-a288-c9a6dee34439 failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_ipc_message: IPC Channel to 1408 is not connected
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_via_callback_channel: Delivery of reply to client 1408/f145a72e-408f-48fa-a288-c9a6dee34439 failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Nov  6 21:35:01 nas2 cib: [1404]: info: cib_process_readwrite: We are now in R/O mode
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_ipc_message: IPC Channel to 1408 is not connected
Nov  6 21:35:01 nas2 cib: [1404]: WARN: send_via_callback_channel: Delivery of reply to client 1408/f145a72e-408f-48fa-a288-c9a6dee34439 failed
Nov  6 21:35:01 nas2 cib: [1404]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Nov  6 21:35:01 nas2 attrd: [1407]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Nov  6 21:35:01 nas2 cib: [2408]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-18.raw
Nov  6 21:35:02 nas2 cib: [2408]: info: write_cib_contents: Wrote version 0.20.0 of the CIB to disk (digest: 589963f79119c376fdddc5b27879c7d0)
Nov  6 21:35:02 nas2 cib: [2408]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.zJj2F4 (digest: /var/lib/heartbeat/crm/cib.trDQfP)

I've no clue to start with this lot, anyone any ideas?

This eventually ends in split brain and both boxes fail to start.

Given I've just set this up using the GUI and mostly used defaults, I'm not entirely sure what I can do except go in under the GUI .. which defeats the object of the exercise ..

The boxes have been running DRBD on a stand-alone hand-edited config and sharing vblade (AoE) between them quite happily, so I'm reasonably happy the boxes themselves are Ok/ properly configured ...

???

Gareth.

----- Original Message -----
From: "Rasto Levrinc" <rasto.levrinc at linbit.com>
To: drbd-mc at lists.linbit.com
Sent: Friday, 6 November, 2009 7:08:13 PM
Subject: Re: [drbd-mc] DRBD UI device issues ...

On Fri, November 6, 2009 6:47 pm, Gareth Bult wrote:
> Hi,

>
>
> Can anyone tell me how I can get it to recognise;
>
>
>
> /dev/md_d0
>
>
>
> (which apparently is the new / standard naming convention for MD devices)
>

can you tell me what cat /proc/mdstat says on your system?
For now you can try to change this line:

 if (/^(md\d+)\s+:\s+(.+)/) {

in /usr/local/bin/drbd-gui-helper on all nodes to something like:

 if (/^(md_d\d+)\s+:\s+(.+)/) {

then you have to start the DMC with --keep-helper option, so the
drbd-gui-helper doesn't get overwritten.

Rasto

-- 
: Dipl-Ing Rastislav Levrinc
: DRBD-MC http://www.drbd.org/mc/management-console/
: DRBD/HA support and consulting http://www.linbit.com/
DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.

_______________________________________________
drbd-mc mailing list
drbd-mc at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-mc

-- 
Gareth Bult (Gareth at Bult.co.uk)