Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 05/10/2017 05:01 AM, Christian Balzer wrote: > --- > [3526901.689492] block drbd0: Remote failed to finish a request within 60444ms > ko-count (10) * timeout (60 * 0.1s) > [3526901.689516] drbd mb11: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) > --- [...] > The node which failed to respond in time had again pretty badly fragmented > memory: With a sleep time of around 60 seconds I would tend to think that any sudden continuation might be a side effect of running a compact-memory task rather than being directly caused by the fact that the memory is fragmented (because even if it is, it seems unlikely that any memory management operation could take that long). In that case, the problem might be caused by a bug in DRBD. The first question would be whether it was the remote system, that failed to finish a request in time - as the error message claims - or whether the local system was stuck and did not receive the remote system's acknowledgement in time. Is there anything to be found in the log of the remote system? > I simply can't believe or accept that manually dropping caches and > compacting memory is required to run a stable DRBD cluster in this day > and age. If the problem is actually related to cache and memory management, and that is what prevents DRBD from running properly, then DRBD would almost certainly be the wrong place to make an attempt to fix it. On a side note, considering this day and age, scheduling and memory management in general purpose operating systems are an especially frustrating subject matter. In the entire design philosophy of virtually all such OSs, scheduling is done more or less randomly, with virtually no guarantees at all as to if or when a certain task will continue or complete. You notice the consequences every time you hear an audio dropout because some thread thought that now is a good time to hog the CPU some time longer than usual. br, -- Robert Altnoeder +43 1 817 82 92 0 robert.altnoeder at linbit.com LINBIT | Keeping The Digital World Running DRBD - Corosync - Pacemaker f / t / in / g+ DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.