Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi DRBD Folks I have a strange issue occurring where zabbix checks for dbbdadmin/pacemaker and alerting at random intervals. This all started after doing a test fail over of master node using drbd. Some of the checks that fail call are executed by zabbix COMMAND=/sbin/drbdadm dstate harddisk COMMAND=/sbin/drbdadm cstate ssddisk COMMAND= /usr/sbin/crm_mon -s COMMAND= /usr/sbin/crm_mon -1 At first i thought that this was a zabbix only problem but then I began to suspect something was going awry.After a few dozen alerts in the middle of the night with no load on system I began to suspect that this was something else. During an event where the timeout of checks for pacemaker drbdadm fails. I was unable to log into systems in timely manner. I have attempted to login to log into mysql server to see what may because this blocking during a alerting event but I noticed that it is taking 2-5 mins to log into server which seemed off for server with LoavAvg in 0.0[1-9] range and iostat -dx was not over capacity. (i checked as soon as I was able to login) I turned sar on server to get better data and found 2 other things occurring at exactly the same time. A spike in totsck and one of the cores having high cpu utilization. Normally totsck was in 500 range but during event it was in 1500 range. totsck tcpsck udpsck rawsck ip-frag tcp-tw 10:45:01 AM 1561 293 18 0 0 838 10:35:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle 10:45:01 AM 5 11.64 0.00 43.87 0.02 0.00 0.00 0.04 0.00 44.42 \ 06:45:01 AM totsck tcpsck udpsck rawsck ip-frag tcp-tw 03:05:06 AM 1562 286 17 0 0 859 03:15:01 AM 1548 286 17 0 0 869 10:35:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle 03:15:01 AM 6 20.88 0.00 79.09 0.00 0.00 0.00 0.03 0.00 0.00 It is clear that something is occurring on server when this occurs and also always occurring in syslog at same time are following events(although the same events occur when zabbix checks/inability to login do no appear to occur also) Jul 8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on ip1 for client 2837: pid 7573 exited with return code 0 Jul 8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms) Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_tim er_popped ] Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867: Requesting the current CIB: S_POLICY_ENGINE Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: Invoking the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, quorate=1 Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-2 Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-1 Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave fs_mysql#011(Started mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave fs_binlog#011(Started mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave ip1#011(Started mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave mysql#011(Started mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave ip1arp#011(Started mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_binlog:0#011(Slave mysql-2) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_binlog:1#011(Master mysql-1) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_mysql:0#011(Slave mysql-2) Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_mysql:1#011(Master mysql-1) I know events are only "notice" and "info" but they always occur when zabbix alerts/we are unable to login in. I know that the zabbix checks aforementioned try to ssh into mysql and run the the checks so their timeouts are related to my time out issues where at same time cannot login via ssh to server. The thing that is very suspect in all of this is that - we never had a problem until we did a test failover. It was only after that failover that we started seeing issues. If anyone else has seem something similar I would be grateful for some insight. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120708/780f97fc/attachment.htm>