[DRBD-user] Page allocation failure (IPOIB, Infiniband, connected mode)

Wed Mar 21 04:48:55 CET 2012

On Tue, 20 Mar 2012 17:04:48 +0100 Lars Ellenberg wrote:

> On Mon, Mar 19, 2012 at 05:04:44PM +0900, Christian Balzer wrote:
> > 
> > Hi Florian,
> > 
> > On Fri, 16 Mar 2012 13:55:17 +0100 Florian Haas wrote:
> > 
> > > On Wed, Mar 14, 2012 at 7:48 AM, Christian Balzer <chibi at gol.com>
> > > wrote:
> > > > Hello,
> > > >
> > > > This is basically a repeat of:
> > > > http://lists.linbit.com/pipermail/drbd-user/2011-August/016758.html
> > > >
> > > > 32GB RAM, Debian Squeeze, 3.2 (debian backport) kernel, 8.3.12
> > > > DRBD, IPOIB in connected mode with a 64k MTU. Just 2 DRBD
> > > > resources.
> > > >
> > > > After encountering this for the first time (never showed up in two
> > > > weeks of stress testing, which only goes to prove that real life
> > > > just can't be simulated) I found the above article and changed the
> > > > following sysctls:
> > > >
> > > > vm/min_free_kbytes = 262144
> > [snip]
> > > >
> > > > Lars hinted at "atomic reserves" in his reply, which particular
> > > > parameters are we talking about here?
> > > 
> > > I had hoped for Lars to pitch in here, but I guess I'll give it a go
> > > instead. Note I'm certainly no kernel memory management expert, but
> > > I'm not aware of anything that would fit that description other than
> > > the vm.min_free_kbytes sysctl you've already mentioned.
> > > 
> > Yeah, that was my assumption, too. 
> 
> Well, no.  Or rather, "it depends".
> 
> The trace you posted contains tcp_sendmsg, so from the send path.
> 
> In the *receive* path, the min_free_kbytes actually make a difference.
> In the *send* path, typically it does not, because we are not in
> "atomic" context, but may block/sleep, and thus this reserve should
> normally not be touched.
> 
OK, that makes sense.

> Also, the problem is not insufficient free memory, but insufficient
> free memory of the desired "order". Put it differently: problem
> is memory fragmentation.
> 
> So you need to look into memory "defragmentation", which is better
> known as "memory compaction" in the linux kernel.
> 
> Relevant sysctls:
> compact_memory (trigger to do an ad-hoc compaction run)
> extfrag_threshold, probably a few more.
>
Well, I tried and succeeded to trigger that allocation failure the best way
I know how to (a "du -s" of the drbd resource, slowly growing the slab as
in inode and dentry caches and thus putting pressure on the VM system).
When starting out, about 30GB were in use and I
monitored /sys/kernel/debug/extfrag/extfrag_index, which looked pretty much
like this:

Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.999 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 1, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 

with the highest order of DMA32 slowly shrinking to 0.996.
But I assume that DMA32 isn't used in this case and none of the other
values ever changed from -1 (which supposedly means no fragmentation or
shortage).
When the failure occurred the VM dropped 9GB of pagecache on the floor
(used memory down to 21GB) and obviously was able to satisfy its needs
after that.
So judging from the extfrag_index there is no fragmentation, or at least
changing the threshold won't do me any good as none of the values ever rose
over 1.

However this paints a slightly different and more grim picture:
# cat /sys/kernel/debug/extfrag/unusable_index 
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.001 0.009 0.017 0.033 0.033 0.097 0.226 
Node 0, zone    DMA32 0.000 0.421 0.658 0.835 0.941 0.978 0.990 0.993 0.994 0.994 1.000 
Node 0, zone   Normal 0.000 0.039 0.115 0.231 0.377 0.548 0.707 0.845 0.940 0.979 0.997 
Node 1, zone   Normal 0.000 0.090 0.250 0.442 0.610 0.747 0.853 0.934 0.983 0.997 0.997 

The closer the value is to 1, the more fragmented and unavailable memory of
that size is.
Issuing a compact_memory run doesn't change it much for the top 3 sizes.

> Or you need to fix the drivers to not require higher order page
> allocation, but be ok with just some single pages scattered around.
> 
Hacking kernel code is definitely beyond me. ^o^
Also from what I gleamed of a discussion on the infiniband ML about this
issue the allocation is part of the standard TCP kernel bits, triggered
most likely by the 64KB MTU.

So I guess I'm still stuck at considering these events "harmless".

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/