[DRBD-user] Strange alerts on zabbix showing pacemaker down

Tue Jul 10 00:47:40 CEST 2012

On 07/09/2012 01:17 AM, Richard Goetz wrote:
> Hi DRBD Folks
> I have a strange issue occurring where zabbix checks for
> dbbdadmin/pacemaker and alerting at random intervals. This all started
> after doing a test fail over of master node using drbd.
> 
> Some of the checks that fail call are executed by zabbix 
> 
> COMMAND=/sbin/drbdadm dstate harddisk 
> COMMAND=/sbin/drbdadm cstate ssddisk
> COMMAND= /usr/sbin/crm_mon -s
> COMMAND= /usr/sbin/crm_mon -1
> 
> At first i thought that this was a zabbix only problem but then I began
> to suspect something was going awry.After a few dozen alerts in the
> middle of the night with no load on system I began to suspect that this
> was something else.
> During an event where the timeout of checks for pacemaker  drbdadm
>  fails. I was unable to log into systems in timely manner.
> I have attempted to login to  log into mysql server to see what may
> because this blocking during a alerting event but I noticed that it is
> taking 2-5 mins to log into server which seemed off for server with
> LoavAvg in 0.0[1-9] range and iostat -dx was not over capacity. (i
> checked as soon as I was able to login)
> 
> I turned sar on server to get better data and found 2 other things
> occurring at exactly the same time. A spike in totsck and one of the
> cores having high cpu utilization. Normally totsck was in 500 range but
> during event it was in 1500 range.

So this is a mysql database and applications connect to it ... have you
checked if all those tcp connections, and a lot of them are in TIME_WAIT
state, are mysql connections? Have you been able to do a remote mysql
connection and executing a  SHOW PROCESSLIST?

Have you tried to do a ssh connection with debug output? ... ssh -vvv to
see more information

DNS resolution is working fine? sshd and Mysql do reverse DNS lookups
per default ...

Regards,
Andreas

-- 
Need help with DRBD?
http://www.hastexo.com/now

> 
>                           totsck    tcpsck    udpsck    rawsck  
> ip-frag    tcp-tw
> 
> 
> 10:45:01 AM      1561       293        18         0         0       838
> 
> 10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal   
>   %irq     %soft    %guest     %idle
> 
> 10:45:01 AM       5     11.64      0.00     43.87      0.02      0.00   
>   0.00      0.04      0.00     44.42
> 
> \
> 
> 06:45:01 AM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
> 
> 03:05:06 AM      1562       286        17         0         0       859
> 
> 03:15:01 AM      1548       286        17         0         0       869
> 
> 10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal   
>   %irq     %soft    %guest     %idle
> 
> 03:15:01 AM       6     20.88      0.00     79.09      0.00      0.00   
>   0.00      0.03      0.00      0.00
> 
> 
> It is clear that something is occurring on server when this occurs and
> also always occurring in syslog at same time are following
> events(although the same events occur when zabbix checks/inability to
> login do no appear to occur also)
> 
> 
> 
> Jul  8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on
> ip1 for client 2837: pid 7573 exited with return code 0
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped (900000ms)
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_tim
> 
> er_popped ]
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867:
> Requesting the current CIB: S_POLICY_ENGINE
> 
> Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback:
> Invoking the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32,
> quorate=1
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op:
> Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-2
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op:
> Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-1
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> fs_mysql#011(Started mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> fs_binlog#011(Started mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> ip1#011(Started mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> mysql#011(Started mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> ip1arp#011(Started mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> drbd_binlog:0#011(Slave mysql-2)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> drbd_binlog:1#011(Master mysql-1)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> drbd_mysql:0#011(Slave mysql-2)
> 
> Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave  
> drbd_mysql:1#011(Master mysql-1)
> 
> 
> I know events are only "notice" and "info" but they always occur when
> zabbix alerts/we are unable to login in. I know that the zabbix
> checks aforementioned try to ssh into mysql and run the the checks so
> their timeouts are related to my time out issues where at same time
> cannot login via ssh to server. 
> 
> The thing that is very suspect in all of this is that - we never had a
> problem until we did a test failover. It was only after that failover
> that we started seeing issues.
> If anyone else has seem something similar I would be grateful  for some
> insight. 
> 
> 
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120710/63a5a996/attachment.pgp>