Hi DRBD Folks<div>I have a strange issue occurring where zabbix checks for dbbdadmin/pacemaker and alerting at random intervals. This all started after doing a test fail over of master node using drbd.</div><div><br></div>

<div>Some of the checks that fail call are executed by zabbix </div><div><br></div><div>COMMAND=/sbin/drbdadm dstate harddisk </div><div>COMMAND=/sbin/drbdadm cstate ssddisk</div><div>COMMAND= /usr/sbin/crm_mon -s</div><div>

COMMAND= /usr/sbin/crm_mon -1</div><div><br></div><div>At first i thought that this was a zabbix only problem but then I began to suspect something was going awry.After a few dozen alerts in the middle of the night with no load on system I began to suspect that this was something else.</div>

<div>During an event where the timeout of checks for pacemaker  drbdadm  fails. I was unable to log into systems in timely manner.</div><div>I have attempted to login to  log into mysql server to see what may because this blocking during a alerting event but I noticed that it is taking 2-5 mins to log into server which seemed off for server with LoavAvg in 0.0[1-9] range and iostat -dx was not over capacity. (i checked as soon as I was able to login)</div>

<div><br></div><div>I turned sar on server to get better data and found 2 other things occurring at exactly the same time. A spike in totsck and one of the cores having high cpu utilization. Normally totsck was in 500 range but during event it was in 1500 range.</div>

<div><br></div><div>


<p class="p1">                          totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw</p>

<p class="p2"><br></p>

<p class="p1">10:45:01 AM      1561       293        18         0         0       838</p>

<p class="p1">10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest     %idle</p>

<p class="p1">10:45:01 AM       5     11.64      0.00     43.87      0.02      0.00      0.00      0.04      0.00     44.42</p><p class="p1">\</p><p class="p1">


</p><p class="p1">06:45:01 AM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw</p>

<p class="p1">03:05:06 AM      1562       286        17         0         0       859</p>

<p class="p1">03:15:01 AM      1548       286        17         0         0       869</p>

<p class="p1">10:35:01 AM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest     %idle</p>

<p class="p1">03:15:01 AM       6     20.88      0.00     79.09      0.00      0.00      0.00      0.03      0.00      0.00</p><p class="p1"><br></p><p class="p1">It is clear that something is occurring on server when this occurs and also always occurring in syslog at same time are following events(although the same events occur when zabbix checks/inability to login do no appear to occur also)</p>

<p class="p1"><br></p><p class="p1"></p><p class="p1"><br></p><p class="p1">Jul  8 03:05:44 mysql-1 lrmd: [2834]: info: operation monitor[191] on ip1 for client 2837: pid 7573 exited with return code 0</p><p class="p1">Jul  8 03:08:00 mysql-1 crmd: [2837]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)</p>

<p class="p1">Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: State transition S_IDLE -&gt; S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_tim</p><p class="p1">er_popped ]</p><p class="p1">

Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED</p><p class="p1">Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.</p>

<p class="p1">Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke: Query 867: Requesting the current CIB: S_POLICY_ENGINE</p><p class="p1">Jul  8 03:08:00 mysql-1 crmd: [2837]: info: do_pe_invoke_callback: Invoking the PE: query=867, ref=pe_calc-dc-1341731280-1029, seq=32, quorate=1</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_config: On loss of CCM Quorum: Ignore</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-2</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on mysql-1</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   fs_mysql#011(Started mysql-1)</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   fs_binlog#011(Started mysql-1)</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   ip1#011(Started mysql-1)</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   mysql#011(Started mysql-1)</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   ip1arp#011(Started mysql-1)</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   drbd_binlog:0#011(Slave mysql-2)</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   drbd_binlog:1#011(Master mysql-1)</p>

<p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   drbd_mysql:0#011(Slave mysql-2)</p><p class="p1">Jul  8 03:08:00 mysql-1 pengine: [1339]: notice: LogActions: Leave   drbd_mysql:1#011(Master mysql-1)</p>

<div><br></div><div>I know events are only &quot;notice&quot; and &quot;info&quot; but they always occur when zabbix alerts/we are unable to login in. I know that the zabbix checks aforementioned try to ssh into mysql and run the the checks so their timeouts are related to my time out issues where at same time cannot login via ssh to server. </div>

<div><br></div><div>The thing that is very suspect in all of this is that - we never had a problem until we did a test failover. It was only after that failover that we started seeing issues.</div><div>If anyone else has seem something similar I would be grateful  for some insight. </div>

<div><br></div><div><br></div><p></p><p></p></div>