[DRBD-user] DRBD versus memory fragmentation

Tue May 16 07:01:26 CEST 2017

Hello,

as a further data point, identical HW including the network stack
(Mellanox IB cards and IPoIB) and similar/identical kernels running
Ceph OSDs isn't having issues, running as long or longer as the affected
DRBD nodes and with similar if not worse memory fragmentation (though
significantly less processes again).  

This IMHO is exonerating the network part even more.

Christian

On Wed, 10 May 2017 21:43:15 +0900 Christian Balzer wrote:

> Hello,
> 
> On Wed, 10 May 2017 12:03:44 +0200 Robert Altnoeder wrote:
> 
> > On 05/10/2017 05:01 AM, Christian Balzer wrote:  
> > > ---
> > > [3526901.689492] block drbd0: Remote failed to finish a request within 60444ms > ko-count (10) * timeout (60 * 0.1s)
> > > [3526901.689516] drbd mb11: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) 
> > > ---    
> > 
> > [...]
> >   
> Make sure to read the previous bits and the now 4 year old thread in this
> ML titled [recovery from "page allocation failure"].
> 
> > > The node which failed to respond in time had again pretty badly fragmented
> > > memory:    
> > With a sleep time of around 60 seconds I would tend to think that any
> > sudden continuation might be a side effect of running a compact-memory
> > task rather than being directly caused by the fact that the memory is
> > fragmented (because even if it is, it seems unlikely that any memory
> > management operation could take that long).
> >  
> Nothing (manual) was done to that system at this time.
> And with all the cleanup/defragmentation that's done automatically by the
> kernel I've never seen any spikes or lookups that would explain this.
> 
> Heck, the manual dropping of caches or compacting of memory are CPU
> intensive, but also seem to be totally non-intrusive, non-disruptive.
>  
> And in the previous cases (where the problem persisted) a simple dropping
> of the pagecache was enough to untwist things.
> 
> > In that case, the problem might be caused by a bug in DRBD. The first
> > question would be whether it was the remote system, that failed to
> > finish a request in time - as the error message claims - or whether the
> > local system was stuck and did not receive the remote system's
> > acknowledgement in time.
> > 
> > Is there anything to be found in the log of the remote system?
> >   
> Nothing indicative at all.
> CRMD and my monitoring detected a high (20 odd) load on the local system,
> but that was a result of DRBD being stuck, not the other way around.
> 
> As mentioned in the initial post, no strange (unexplainable) load spikes.
> And most importantly always a uni-directional failure of one resource,
> while the other one (being primary on the supposedly slow system) doesn't
> fail at the same time.
> Something being stuck for that long in the kernel or network stack should
> affect both resources, so my suspect is DRBD, but that's not conclusive of
> course.
> Though the network is pretty much off the hook AFAICT, not only because
> the other DRBD resource keeps working, but also because corosync has no
> issues over that link.
> 
> > > I simply can't believe or accept that manually dropping caches and
> > > compacting memory is required to run a stable DRBD cluster in this day
> > > and age.    
> > If the problem is actually related to cache and memory management, and
> > that is what prevents DRBD from running properly, then DRBD would almost
> > certainly be the wrong place to make an attempt to fix it.
> >   
> Correct, but how to narrow things down?
> 
> On my very old clusters with 3.2 to 3.4 custom kernels and DRBD 8.4.4 I see
> the long ago mentioned page allocation failures, but VERY infrequently
> since bumping up vm/min_free_kbytes to 512MB (1-2GB on the brand new
> clusters). They all have 32GB RAM.
> 
> On an intermediary cluster (same HW as the 2 new ones, but with "only"
> 64GB RAM) running a stock Wheezy install (thus kernel 3.16 and DRBD 8.4.3)
> I see neither allocation failures nor these timeouts. 
> OTOH it has far less processes running, but the fragmentation is pretty
> much the same.
> 
> The new clusters (with tens of thousands idle IMAP processes, but they are
> at best contributing to the fragmentation) with either DRBD 8.4.3 or 8.4.7
> and kernel 4.9 show this problem, 128 or 256GB RAM.
> 
> If it is a problem of DRBD allocating things on a timely fashion, how
> about DRBD not giving back RAM it has previously gotten (and thus needed)
> unless asked for?
> I'm thinking along the lines of Ceph OSD daemons, whose heap will also not
> shrink unless requested.
> Donating a few hundred MBs to DRBD as opposed to getting things stuck seems
> a fair enough deal to me.
> 
> > On a side note, considering this day and age, scheduling and memory
> > management in general purpose operating systems are an especially
> > frustrating subject matter. In the entire design philosophy of virtually
> > all such OSs, scheduling is done more or less randomly, with virtually
> > no guarantees at all as to if or when a certain task will continue or
> > complete. You notice the consequences every time you hear an audio
> > dropout because some thread thought that now is a good time to hog the
> > CPU some time longer than usual.
> >   
> Also correct, but then again people have worked their ways around this.
> 
> As I stated before, I can pretty much rule out CPU contention as such,
> plenty of FAST cores, the systems are near idle at the time (load of 0.3
> or so) and the storage system (all Intel DC S SSDs) is bored as well.
> 
> The kernel being a dick in some capacity I can very well imagine of course.
> 
> Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/