Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I have a nasty problem with my cluster. For some reason it sometimes fails DRBD with "Digest integrity check FAILED". If I understand this correctly, that is OK and DRBD will reconnect at once. However before it does that, the cluster fences the secondary node and thus disables any possibility of cluster ever working again - until I manually clear the fencing rules out of crm config. The log looks like this: Nov 17 18:30:52 srv1 kernel: [2299058.247328] block drbd1: Digest integrity check FAILED. Nov 17 18:30:52 srv1 kernel: [2299058.265232] block drbd1: error receiving Data, l: 4124! Nov 17 18:30:52 srv1 kernel: [2299058.282986] block drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Nov 17 18:30:52 srv1 kernel: [2299058.283046] block drbd1: asender terminated Nov 17 18:30:52 srv1 kernel: [2299058.283054] block drbd1: Terminating drbd1_asender Nov 17 18:30:52 srv1 kernel: [2299058.321615] block drbd1: Connection closed Nov 17 18:30:52 srv1 kernel: [2299058.321622] block drbd1: conn( ProtocolError -> Unconnected ) Nov 17 18:30:52 srv1 kernel: [2299058.321629] block drbd1: receiver terminated Nov 17 18:30:52 srv1 kernel: [2299058.321632] block drbd1: Restarting drbd1_receiver Nov 17 18:30:52 srv1 kernel: [2299058.321636] block drbd1: receiver (re)started Nov 17 18:30:52 srv1 kernel: [2299058.321641] block drbd1: conn( Unconnected -> WFConnection ) Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="3636" num_updates="19" /> Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <cib admin_epoch="0" epoch="3637" num_updates="1" > Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <configuration > Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <constraints > Nov 17 18:30:52 srv1 crmd: [1363]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <rsc_location rsc="ms-drbd-r1" id="drbd-fence-by-handler-ms-drbd-r1"__crm_diff_marker__="ad ded:top" > Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-rule-ms-drbd-r1" > Nov 17 18:30:52 srv1 crmd: [1363]: info: need_abort: Aborting on change to admin_epoch Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + <expression attribute="#uname" operation="ne" value="server2" id="drbd-fence-by-handler-expr-ms-drbd-r1" /> Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + </rule> Nov 17 18:30:52 srv1 crmd: [1363]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + </rsc_location> Nov 17 18:30:52 srv1 crmd: [1363]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + </constraints> Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + </configuration> Nov 17 18:30:52 srv1 crmd: [1363]: info: do_pe_invoke: Query 3890: Requesting the current CIB: S_POLICY_ENGINE Nov 17 18:30:52 srv1 cib: [1359]: info: log_data_element: cib:diff: + </cib> Nov 17 18:30:52 srv1 cib: [1359]: info: cib_process_request: Operation complete: op cib_create for section constraints (origin=server2/cibadmin/2, version=0.3637.1): ok (rc=0) What can I do to fight this? I have no idea why is the communication fails sometimes, although the NICs and cabling is perfect. However I read in mailing lists, that it might happen with some NIC/kernel combination. Can we force the cluster soft to wait for reconnect a little bit? Thanks, Dmitry -- View this message in context: http://old.nabble.com/pacemaker-corosync-fence-drbd-on-digest-integrity-error---do-not-wait-for-reconnect-tp30238957p30238957.html Sent from the DRBD - User mailing list archive at Nabble.com.