Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi DRBD Folks
I have a strange issue occurring where zabbix checks for
dbbdadmin/pacemaker and alerting at random intervals. This all started
after doing a test fail over of master node using drbd.
Some of the checks that fail call are executed by zabbix
COMMAND=/sbin/drbdadm dstate harddisk
COMMAND=/sbin/drbdadm cstate ssddisk
COMMAND= /usr/sbin/crm_mon -s
COMMAND= /usr/sbin/crm_mon -1
At first i thought that this was a zabbix only problem but then I began to
suspect something was going awry.After a few dozen alerts in the middle of
the night with no load on system I began to suspect that this was something
else.
During an event where the timeout of checks for pacemaker drbdadm fails.
I was unable to log into systems in timely manner.
I have attempted to login to log into mysql server to see what may because
this blocking during a alerting event but I noticed that it is taking 2-5
mins to log into server which seemed off for server with LoavAvg in
0.0[1-9] range and iostat -dx was not over capacity. (i checked as soon as
I was able to login)
I turned sar on server to get better data and found 2 other things
occurring at exactly the same time. A spike in totsck and one of the cores
having high cpu utilization. Normally totsck was in 500 range but during
event it was in 1500 range.
totsck tcpsck udpsck rawsck ip-frag
tcp-tw
10:45:01 AM 1561 293 18 0 0 838
10:35:01 AM CPU %usr %nice %sys %iowait %steal
%irq %soft %guest %idle
10:45:01 AM 5 11.64 0.00 43.87 0.02 0.00
0.00 0.04 0.00 44.42
\
06:45:01 AM totsck tcpsck udpsck rawsck ip-frag tcp-tw
03:05:06 AM 1562 286 17 0 0 859
03:15:01 AM 1548 286 17 0 0 869
10:35:01 AM CPU %usr %nice %sys %iowait %steal
%irq %soft %guest %idle
03:15:01 AM 6 20.88 0.00 79.09 0.00 0.00
0.00 0.03 0.00 0.00
It is clear that something is occurring on server when this occurs and also
always occurring in syslog at same time are following events(although the
same events occur when zabbix checks/inability to login do no appear to
occur also)
Jul 8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on ip1
for client 2837: pid 7573 exited with return code 0
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine
Recheck Timer (I_PE_CALC) just popped (900000ms)
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_tim
er_popped ]
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867:
Requesting the current CIB: S_POLICY_ENGINE
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: Invoking
the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, quorate=1
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation
ip1arp_last_failure_0 found resource ip1arp active on mysql-2
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation
ip1arp_last_failure_0 found resource ip1arp active on mysql-1
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
fs_mysql#011(Started mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
fs_binlog#011(Started mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
ip1#011(Started mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
mysql#011(Started mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
ip1arp#011(Started mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_binlog:0#011(Slave mysql-2)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_binlog:1#011(Master mysql-1)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_mysql:0#011(Slave mysql-2)
Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave
drbd_mysql:1#011(Master mysql-1)
I know events are only "notice" and "info" but they always occur when
zabbix alerts/we are unable to login in. I know that the zabbix
checks aforementioned try to ssh into mysql and run the the checks so their
timeouts are related to my time out issues where at same time cannot login
via ssh to server.
The thing that is very suspect in all of this is that - we never had a
problem until we did a test failover. It was only after that failover that
we started seeing issues.
If anyone else has seem something similar I would be grateful for some
insight.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120708/780f97fc/attachment.htm>