[DRBD-user] Synchronization delayed

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jul 16 18:48:02 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Jul 15, 2014 at 06:43:59PM +0200, Antonio Fernández Pérez wrote:
> Hi list,
> 
> I have decided to post here because I have some doubts with my DRBD status.
> 
> I have detected that there are some IO problems on my dedicated server
> that works with MySQL over DRBD cluster with Corosync+Pacemaker.
> Checking system status, I have seen that iostat shows overload
> with drbd device.

Well, no it does not.
The %util column is pretty much useless.
Especially with "virtual" devices.

(and drbd's usage of the update primitives for the disk stats
was not always correct either).

What this says is simly that usually a new request comes in while a
previous request is still pending.
"submission latency" is much lower than completion latency,
and you have some parallelism.
That is all you can get from "100% utilization".
That is no indication of "overload" at all.


> Then, I have checked drbd status reading logs. In the corosync.log file appears the following line:
> 
> Jul 15 17:50:15 llwh747-y lrmd: [1390]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 560 ms (> 100 ms) before being called (GSource: 0xa69e10)
>
> I'm not sure if this warning is related with IO hard disk problems or if this warning can derivate in a bigger problem.

It is very unlikely that this would be related to load on DRBD at all.

It may simply be an indication that lrmd had "something else" to do
between detecting the signal and acting on it.
Blocking send on some socket maybe. Or whatever.

It could possibly be resolved with upgrading your cluster stack software,
or by giving it more resources (cpu).

If that's the only warning you get, though,
and it keeps being in the sub second range,
you can probably safely ignore it.

> Let me put iostat output.
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sdb               1.60    12.60    0.80   13.60    19.20   206.80    15.69     0.04    2.89   1.28   1.84
> dm-4              0.00     0.00    2.40   26.20    19.20   206.80     7.90     0.09    3.22   0.67   1.92
> drbd0             0.00     0.00    2.40   25.80    19.20   206.40     8.00     8.81    3.83  35.46 100.00

As I said: not very exiting.
Only shows that your DRBD setup has some significant latency,
for whatever reason.
In fact that device is mostly idle.
I mean, that log shows 45 requests.
I assure you that DRBD can handle a factor of several thousand more.
If you would just throw enough IO that way,
it would likely take *much* more (while still showing "100% utilization".


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list