Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 07/09/2012 01:17 AM, Richard Goetz wrote: > Hi DRBD Folks > I have a strange issue occurring where zabbix checks for > dbbdadmin/pacemaker and alerting at random intervals. This all started > after doing a test fail over of master node using drbd. > > Some of the checks that fail call are executed by zabbix > > COMMAND=/sbin/drbdadm dstate harddisk > COMMAND=/sbin/drbdadm cstate ssddisk > COMMAND= /usr/sbin/crm_mon -s > COMMAND= /usr/sbin/crm_mon -1 > > At first i thought that this was a zabbix only problem but then I began > to suspect something was going awry.After a few dozen alerts in the > middle of the night with no load on system I began to suspect that this > was something else. > During an event where the timeout of checks for pacemaker drbdadm > fails. I was unable to log into systems in timely manner. > I have attempted to login to log into mysql server to see what may > because this blocking during a alerting event but I noticed that it is > taking 2-5 mins to log into server which seemed off for server with > LoavAvg in 0.0[1-9] range and iostat -dx was not over capacity. (i > checked as soon as I was able to login) > > I turned sar on server to get better data and found 2 other things > occurring at exactly the same time. A spike in totsck and one of the > cores having high cpu utilization. Normally totsck was in 500 range but > during event it was in 1500 range. So this is a mysql database and applications connect to it ... have you checked if all those tcp connections, and a lot of them are in TIME_WAIT state, are mysql connections? Have you been able to do a remote mysql connection and executing a SHOW PROCESSLIST? Have you tried to do a ssh connection with debug output? ... ssh -vvv to see more information DNS resolution is working fine? sshd and Mysql do reverse DNS lookups per default ... Regards, Andreas -- Need help with DRBD? http://www.hastexo.com/now > > totsck tcpsck udpsck rawsck > ip-frag tcp-tw > > > 10:45:01 AM 1561 293 18 0 0 838 > > 10:35:01 AM CPU %usr %nice %sys %iowait %steal > %irq %soft %guest %idle > > 10:45:01 AM 5 11.64 0.00 43.87 0.02 0.00 > 0.00 0.04 0.00 44.42 > > \ > > 06:45:01 AM totsck tcpsck udpsck rawsck ip-frag tcp-tw > > 03:05:06 AM 1562 286 17 0 0 859 > > 03:15:01 AM 1548 286 17 0 0 869 > > 10:35:01 AM CPU %usr %nice %sys %iowait %steal > %irq %soft %guest %idle > > 03:15:01 AM 6 20.88 0.00 79.09 0.00 0.00 > 0.00 0.03 0.00 0.00 > > > It is clear that something is occurring on server when this occurs and > also always occurring in syslog at same time are following > events(although the same events occur when zabbix checks/inability to > login do no appear to occur also) > > > > Jul 8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on > ip1 for client 2837: pid 7573 exited with return code 0 > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine > Recheck Timer (I_PE_CALC) just popped (900000ms) > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED origin=crm_tim > > er_popped ] > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: > Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2 > cluster nodes are eligible to run resources. > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867: > Requesting the current CIB: S_POLICY_ENGINE > > Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: > Invoking the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, > quorate=1 > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss > of CCM Quorum: Ignore > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: > Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-2 > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: > Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-1 > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > fs_mysql#011(Started mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > fs_binlog#011(Started mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > ip1#011(Started mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > mysql#011(Started mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > ip1arp#011(Started mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > drbd_binlog:0#011(Slave mysql-2) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > drbd_binlog:1#011(Master mysql-1) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > drbd_mysql:0#011(Slave mysql-2) > > Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave > drbd_mysql:1#011(Master mysql-1) > > > I know events are only "notice" and "info" but they always occur when > zabbix alerts/we are unable to login in. I know that the zabbix > checks aforementioned try to ssh into mysql and run the the checks so > their timeouts are related to my time out issues where at same time > cannot login via ssh to server. > > The thing that is very suspect in all of this is that - we never had a > problem until we did a test failover. It was only after that failover > that we started seeing issues. > If anyone else has seem something similar I would be grateful for some > insight. > > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 222 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120710/63a5a996/attachment.pgp>