[DRBD-user] DRBD timed out making corosync/pacemaker hang

Fri Sep 12 08:36:51 CEST 2014

Hi,

Please help me understand what is causing the problem. I have a 2 node
cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on
a separate hypervisor on separate machines. All are working well during
testing (I restarted the vms alternately), but after a day when I kill the
other node, I always end up corosync and pacemaker hangs on the surviving
node because drbd timed out. Date and time on the vms are in sync, I use
unicast, tcpdump shows both nodes exchanges, confirmed that DRBD is healthy
and crm_mon show good status before I kill the other node. Below are my
configurations and versions I used:

corosync             2.3.3-1ubuntu1
crmsh                1.2.5+hg1034-1ubuntu3
drbd8-utils          2:8.4.4-1ubuntu1
libcorosync-common4  2.3.3-1ubuntu1
libcrmcluster4       1.1.10+git20130802-1ubuntu2
libcrmcommon3        1.1.10+git20130802-1ubuntu2
libcrmservice1       1.1.10+git20130802-1ubuntu2
pacemaker            1.1.10+git20130802-1ubuntu2
pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2
postgresql-9.3       9.3.5-0ubuntu0.14.04.1

# /etc/corosync/corosync:
totem {
  version: 2
  token: 3000
  token_retransmits_before_loss_const: 10
  join: 60
  consensus: 3600
  vsftype: none
  max_messages: 20
  clear_node_high_bit: yes
  secauth: off
  threads: 0
  rrp_mode: none
  transport: udpu
  interface {
    member {
      memberaddr: 10.2.136.56
    }
    member {
      memberaddr: 10.2.136.57
    }
    ringnumber: 0
    bindnetaddr: 10.2.136.0
    mcastport: 5405
  }
}
amf {
mode: disabled
}
quorum {
provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
}
aisexec {
        user:   root
        group:  root
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

# /etc/corosync/service.d/pcmk:
service {
  name: pacemaker
  ver: 1
}

/etc/drbd.d/global_common.conf:
global {
  usage-count no;
}

common {
  net {
    protocol C;
  }
}

# /etc/drbd.d/pg.res:
resource pg {
  device /dev/drbd0;
  disk /dev/vdb;
  meta-disk internal;
  disk {
    fencing resource-only;
    on-io-error detach;
    resync-rate 40M;
  }
    handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
  on node01 {
    address 10.2.136.56:7789;
  }
  on node02 {
    address 10.2.136.57:7789;
  }
  net {
    verify-alg md5;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
}

# Pacemaker configuration:
node $id="167938104" node01
node $id="167938105" node02
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="pg" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive fs_pg ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"
fstype="ext4"
primitive ip_pg ocf:heartbeat:IPaddr2 \
params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
primitive lsb_pg lsb:postgresql
group PGServer fs_pg lsb_pg ip_pg
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

# Logs on node01
node01 crmd[1019]:   notice: peer_update_callback: Our peer on the DC is
dead
node01 crmd[1019]:   notice: do_state_transition: State transition S_NOT_DC
-> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK
origin=peer_update_callback ]
node01 crmd[1019]:   notice: do_state_transition: State transition
S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL
origin=do_election_check ]
node01 corosync[940]:   [TOTEM ] A new membership (10.2.136.56:52) was
formed. Members left: 167938105
node01 kernel: [74452.740024] d-con pg: PingAck did not arrive in time.
node01 kernel: [74452.740169] d-con pg: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
node01 kernel: [74452.740987] d-con pg: asender terminated
node01 kernel: [74452.740999] d-con pg: Terminating drbd_a_pg
node01 kernel: [74452.741235] d-con pg: Connection closed
node01 kernel: [74452.741259] d-con pg: conn( NetworkFailure -> Unconnected
)
node01 kernel: [74452.741260] d-con pg: receiver terminated
node01 kernel: [74452.741261] d-con pg: Restarting receiver thread
node01 kernel: [74452.741262] d-con pg: receiver (re)started
node01 kernel: [74452.741269] d-con pg: conn( Unconnected -> WFConnection )
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 8445) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8445
- timed out after 20000ms
node01 crmd[1019]:    error: process_lrm_event: LRM operation
drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
node01 crmd[1019]:  warning: cib_rsc_callback: Resource update 23 failed:
(rc=-62) Timer expired
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 8693) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8693
- timed out after 20000ms
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 8938) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8938
- timed out after 20000ms
node01 crmd[1019]:    error: crm_timer_popped: Integration Timer
(I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
node01 crmd[1019]:  warning: do_state_transition: Progressed to state
S_FINALIZE_JOIN after C_TIMER_POPPED
node01 crmd[1019]:  warning: do_state_transition: 1 cluster nodes failed to
respond to the join offer.
node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node02=none
node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node01=welcomed
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 9185) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9185
- timed out after 20000ms
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 9432) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9432
- timed out after 20000ms
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 9680) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9680
- timed out after 20000ms
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 9927) timed out
node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9927
- timed out after 20000ms
node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000
process (PID 10174) timed out
node01 lrmd[1016]:  warning: operation_finished:
drbd_pg_monitor_31000:10174 - timed out after 20000ms

#crm_mon on node01 before I kill the other vm:
Stack: corosync
Current DC: node02 (167938104) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
5 Resources configured

Online: [ node01 node02 ]

 Resource Group: PGServer
     fs_pg      (ocf::heartbeat:Filesystem):    Started node02
     lsb_pg     (lsb:postgresql):       Started node02
     ip_pg      (ocf::heartbeat:IPaddr2):       Started node02
 Master/Slave Set: ms_drbd_pg [drbd_pg]
     Masters: [ node02 ]
     Slaves: [ node01 ]

Thank you,
Kiam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140912/5f6c801a/attachment.htm>