Hi DRBD Folks<div>I have a strange issue occurring where zabbix checks for dbbdadmin/pacemaker and alerting at random intervals. This all started after doing a test fail over of master node using drbd.</div><div><br></div>
<div>Some of the checks that fail call are executed by zabbix </div><div><br></div><div>COMMAND=/sbin/drbdadm dstate harddisk </div><div>COMMAND=/sbin/drbdadm cstate ssddisk</div><div>COMMAND= /usr/sbin/crm_mon -s</div><div>
COMMAND= /usr/sbin/crm_mon -1</div><div><br></div><div>At first i thought that this was a zabbix only problem but then I began to suspect something was going awry.After a few dozen alerts in the middle of the night with no load on system I began to suspect that this was something else.</div>
<div>During an event where the timeout of checks for pacemaker drbdadm fails. I was unable to log into systems in timely manner.</div><div>I have attempted to login to log into mysql server to see what may because this blocking during a alerting event but I noticed that it is taking 2-5 mins to log into server which seemed off for server with LoavAvg in 0.0[1-9] range and iostat -dx was not over capacity. (i checked as soon as I was able to login)</div>
<div><br></div><div>I turned sar on server to get better data and found 2 other things occurring at exactly the same time. A spike in totsck and one of the cores having high cpu utilization. Normally totsck was in 500 range but during event it was in 1500 range.</div>
<div><br></div><div>
<p class="p1"> totsck tcpsck udpsck rawsck ip-frag tcp-tw</p>
<p class="p2"><br></p>
<p class="p1">10:45:01 AM 1561 293 18 0 0 838</p>
<p class="p1">10:35:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle</p>
<p class="p1">10:45:01 AM 5 11.64 0.00 43.87 0.02 0.00 0.00 0.04 0.00 44.42</p><p class="p1">\</p><p class="p1">
</p><p class="p1">06:45:01 AM totsck tcpsck udpsck rawsck ip-frag tcp-tw</p>
<p class="p1">03:05:06 AM 1562 286 17 0 0 859</p>
<p class="p1">03:15:01 AM 1548 286 17 0 0 869</p>
<p class="p1">10:35:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle</p>
<p class="p1">03:15:01 AM 6 20.88 0.00 79.09 0.00 0.00 0.00 0.03 0.00 0.00</p><p class="p1"><br></p><p class="p1">It is clear that something is occurring on server when this occurs and also always occurring in syslog at same time are following events(although the same events occur when zabbix checks/inability to login do no appear to occur also)</p>
<p class="p1"><br></p><p class="p1"></p><p class="p1"><br></p><p class="p1">Jul 8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on ip1 for client 2837: pid 7573 exited with return code 0</p><p class="p1">Jul 8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)</p>
<p class="p1">Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_tim</p><p class="p1">er_popped ]</p><p class="p1">
Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED</p><p class="p1">Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.</p>
<p class="p1">Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867: Requesting the current CIB: S_POLICY_ENGINE</p><p class="p1">Jul 8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: Invoking the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, quorate=1</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss of CCM Quorum: Ignore</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-2</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-1</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave fs_mysql#011(Started mysql-1)</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave fs_binlog#011(Started mysql-1)</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave ip1#011(Started mysql-1)</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave mysql#011(Started mysql-1)</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave ip1arp#011(Started mysql-1)</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_binlog:0#011(Slave mysql-2)</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_binlog:1#011(Master mysql-1)</p>
<p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_mysql:0#011(Slave mysql-2)</p><p class="p1">Jul 8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave drbd_mysql:1#011(Master mysql-1)</p>
<div><br></div><div>I know events are only "notice" and "info" but they always occur when zabbix alerts/we are unable to login in. I know that the zabbix checks aforementioned try to ssh into mysql and run the the checks so their timeouts are related to my time out issues where at same time cannot login via ssh to server. </div>
<div><br></div><div>The thing that is very suspect in all of this is that - we never had a problem until we did a test failover. It was only after that failover that we started seeing issues.</div><div>If anyone else has seem something similar I would be grateful for some insight. </div>
<div><br></div><div><br></div><p></p><p></p></div>