Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Please help me understand what is causing the problem. I have a 2 node cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on a separate hypervisor on separate machines. All are working well during testing (I restarted the vms alternately), but after a day when I kill the other node, I always end up corosync and pacemaker hangs on the surviving node because drbd timed out. Date and time on the vms are in sync, I use unicast, tcpdump shows both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good status before I kill the other node. Below are my configurations and versions I used: corosync 2.3.3-1ubuntu1 crmsh 1.2.5+hg1034-1ubuntu3 drbd8-utils 2:8.4.4-1ubuntu1 libcorosync-common4 2.3.3-1ubuntu1 libcrmcluster4 1.1.10+git20130802-1ubuntu2 libcrmcommon3 1.1.10+git20130802-1ubuntu2 libcrmservice1 1.1.10+git20130802-1ubuntu2 pacemaker 1.1.10+git20130802-1ubuntu2 pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 postgresql-9.3 9.3.5-0ubuntu0.14.04.1 # /etc/corosync/corosync: totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 3600 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: off threads: 0 rrp_mode: none transport: udpu interface { member { memberaddr: 10.2.136.56 } member { memberaddr: 10.2.136.57 } ringnumber: 0 bindnetaddr: 10.2.136.0 mcastport: 5405 } } amf { mode: disabled } quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } aisexec { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } # /etc/corosync/service.d/pcmk: service { name: pacemaker ver: 1 } /etc/drbd.d/global_common.conf: global { usage-count no; } common { net { protocol C; } } # /etc/drbd.d/pg.res: resource pg { device /dev/drbd0; disk /dev/vdb; meta-disk internal; disk { fencing resource-only; on-io-error detach; resync-rate 40M; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } on node01 { address 10.2.136.56:7789; } on node02 { address 10.2.136.57:7789; } net { verify-alg md5; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } } # Pacemaker configuration: node $id="167938104" node01 node $id="167938105" node02 primitive drbd_pg ocf:linbit:drbd \ params drbd_resource="pg" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave" primitive fs_pg ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" fstype="ext4" primitive ip_pg ocf:heartbeat:IPaddr2 \ params ip="10.2.136.59" cidr_netmask="24" nic="eth0" primitive lsb_pg lsb:postgresql group PGServer fs_pg lsb_pg ip_pg ms ms_drbd_pg drbd_pg \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" rsc_defaults $id="rsc-options" \ resource-stickiness="100" # Logs on node01 node01 crmd[1019]: notice: peer_update_callback: Our peer on the DC is dead node01 crmd[1019]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] node01 crmd[1019]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] node01 corosync[940]: [TOTEM ] A new membership (10.2.136.56:52) was formed. Members left: 167938105 node01 kernel: [74452.740024] d-con pg: PingAck did not arrive in time. node01 kernel: [74452.740169] d-con pg: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) node01 kernel: [74452.740987] d-con pg: asender terminated node01 kernel: [74452.740999] d-con pg: Terminating drbd_a_pg node01 kernel: [74452.741235] d-con pg: Connection closed node01 kernel: [74452.741259] d-con pg: conn( NetworkFailure -> Unconnected ) node01 kernel: [74452.741260] d-con pg: receiver terminated node01 kernel: [74452.741261] d-con pg: Restarting receiver thread node01 kernel: [74452.741262] d-con pg: receiver (re)started node01 kernel: [74452.741269] d-con pg: conn( Unconnected -> WFConnection ) node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8445) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:8445 - timed out after 20000ms node01 crmd[1019]: error: process_lrm_event: LRM operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms) node01 crmd[1019]: warning: cib_rsc_callback: Resource update 23 failed: (rc=-62) Timer expired node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8693) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:8693 - timed out after 20000ms node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8938) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:8938 - timed out after 20000ms node01 crmd[1019]: error: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms) node01 crmd[1019]: warning: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED node01 crmd[1019]: warning: do_state_transition: 1 cluster nodes failed to respond to the join offer. node01 crmd[1019]: notice: crmd_join_phase_log: join-1: node02=none node01 crmd[1019]: notice: crmd_join_phase_log: join-1: node01=welcomed node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9185) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:9185 - timed out after 20000ms node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9432) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:9432 - timed out after 20000ms node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9680) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:9680 - timed out after 20000ms node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9927) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:9927 - timed out after 20000ms node01 lrmd[1016]: warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 10174) timed out node01 lrmd[1016]: warning: operation_finished: drbd_pg_monitor_31000:10174 - timed out after 20000ms #crm_mon on node01 before I kill the other vm: Stack: corosync Current DC: node02 (167938104) - partition with quorum Version: 1.1.10-42f2063 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] Resource Group: PGServer fs_pg (ocf::heartbeat:Filesystem): Started node02 lsb_pg (lsb:postgresql): Started node02 ip_pg (ocf::heartbeat:IPaddr2): Started node02 Master/Slave Set: ms_drbd_pg [drbd_pg] Masters: [ node02 ] Slaves: [ node01 ] Thank you, Kiam -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140912/5f6c801a/attachment.htm>