[DRBD-user] DRBD versus memory fragmentation

Robert Altnoeder robert.altnoeder at linbit.com
Wed May 10 12:03:44 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On 05/10/2017 05:01 AM, Christian Balzer wrote:
> ---
> [3526901.689492] block drbd0: Remote failed to finish a request within 60444ms > ko-count (10) * timeout (60 * 0.1s)
> [3526901.689516] drbd mb11: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) 
> ---


> The node which failed to respond in time had again pretty badly fragmented
> memory:
With a sleep time of around 60 seconds I would tend to think that any
sudden continuation might be a side effect of running a compact-memory
task rather than being directly caused by the fact that the memory is
fragmented (because even if it is, it seems unlikely that any memory
management operation could take that long).

In that case, the problem might be caused by a bug in DRBD. The first
question would be whether it was the remote system, that failed to
finish a request in time - as the error message claims - or whether the
local system was stuck and did not receive the remote system's
acknowledgement in time.

Is there anything to be found in the log of the remote system?

> I simply can't believe or accept that manually dropping caches and
> compacting memory is required to run a stable DRBD cluster in this day
> and age.
If the problem is actually related to cache and memory management, and
that is what prevents DRBD from running properly, then DRBD would almost
certainly be the wrong place to make an attempt to fix it.

On a side note, considering this day and age, scheduling and memory
management in general purpose operating systems are an especially
frustrating subject matter. In the entire design philosophy of virtually
all such OSs, scheduling is done more or less randomly, with virtually
no guarantees at all as to if or when a certain task will continue or
complete. You notice the consequences every time you hear an audio
dropout because some thread thought that now is a good time to hog the
CPU some time longer than usual.

Robert Altnoeder
+43 1 817 82 92 0
robert.altnoeder at linbit.com

LINBIT | Keeping The Digital World Running
DRBD - Corosync - Pacemaker
f /  t /  in /  g+

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

More information about the drbd-user mailing list