[DRBD-user] Strange alerts on zabbix showing pacemaker down

Mon Jul 9 01:17:13 CEST 2012

Hi DRBD Folks
I have a strange issue occurring where zabbix checks for
dbbdadmin/pacemaker and alerting at random intervals. This all started
after doing a test fail over of master node using drbd.

Some of the checks that fail call are executed by zabbix

COMMAND=/sbin/drbdadm dstate harddisk
COMMAND=/sbin/drbdadm cstate ssddisk
COMMAND= /usr/sbin/crm_mon -s
COMMAND= /usr/sbin/crm_mon -1

At first i thought that this was a zabbix only problem but then I began to
suspect something was going awry.After a few dozen alerts in the middle of
the night with no load on system I began to suspect that this was something
else.
During an event where the timeout of checks for pacemaker  drbdadm  fails.
I was unable to log into systems in timely manner.
I have attempted to login to  log into mysql server to see what may because
this blocking during a alerting event but I noticed that it is taking 2-5
mins to log into server which seemed off for server with LoavAvg in
0.0[1-9] range and iostat -dx was not over capacity. (i checked as soon as
I was able to login)

I turned sar on server to get better data and found 2 other things
occurring at exactly the same time. A spike in totsck and one of the cores
having high cpu utilization. Normally totsck was in 500 range but during
event it was in 1500 range.

                          totsck    tcpsck    udpsck    rawsck   ip-frag
tcp-tw

10:45:01 AM      1561       293        18         0         0       838

10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest     %idle

10:45:01 AM       5     11.64      0.00     43.87      0.02      0.00
0.00      0.04      0.00     44.42

\

06:45:01 AM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw

03:05:06 AM      1562       286        17         0         0       859

03:15:01 AM      1548       286        17         0         0       869

10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest     %idle

03:15:01 AM       6     20.88      0.00     79.09      0.00      0.00
0.00      0.03      0.00      0.00

It is clear that something is occurring on server when this occurs and also
always occurring in syslog at same time are following events(although the
same events occur when zabbix checks/inability to login do no appear to
occur also)

Jul  8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on ip1
for client 2837: pid 7573 exited with return code 0

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine
Recheck Timer (I_PE_CALC) just popped (900000ms)

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_tim

er_popped ]

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867:
Requesting the current CIB: S_POLICY_ENGINE

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: Invoking
the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, quorate=1

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss of
CCM Quorum: Ignore

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation
ip1arp_last_failure_0 found resource ip1arp active on mysql-2

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation
ip1arp_last_failure_0 found resource ip1arp active on mysql-1

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
fs_mysql#011(Started mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
fs_binlog#011(Started mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
ip1#011(Started mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
mysql#011(Started mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
ip1arp#011(Started mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_binlog:0#011(Slave mysql-2)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_binlog:1#011(Master mysql-1)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_mysql:0#011(Slave mysql-2)

Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_mysql:1#011(Master mysql-1)

I know events are only "notice" and "info" but they always occur when
zabbix alerts/we are unable to login in. I know that the zabbix
checks aforementioned try to ssh into mysql and run the the checks so their
timeouts are related to my time out issues where at same time cannot login
via ssh to server.

The thing that is very suspect in all of this is that - we never had a
problem until we did a test failover. It was only after that failover that
we started seeing issues.
If anyone else has seem something similar I would be grateful  for some
insight.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120708/780f97fc/attachment.htm>