[DRBD-user] recovery from "page allocation failure"

Lars Ellenberg lars.ellenberg at linbit.com
Thu Jul 18 10:53:12 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Jul 18, 2013 at 10:11:15AM +0900, Christian Balzer wrote:
> On Wed, 17 Jul 2013 11:27:23 +0200 Lars Ellenberg wrote:
> 
> > On Wed, Jul 17, 2013 at 05:25:13PM +0900, Christian Balzer wrote:
> > > 
> > > 
> > > On a very busy cluster with kernel 3.4.48 and DRBD 8.4.3 I was able to
> > > reduce these kernel messages from dozens a day to nearly none by
> > > setting
> > > 
> > > vm/min_free_kbytes = 524288
> > 
> > Yes, that's a setting that should typically help.
> > 
> > > Lars, as this keeps popping up and always suggests DRBD to be guilty
> > > party even if it's not, I wonder if you guys should have some back
> > > channel talk with the relevant people on the kernel ML...
> > 
> > I don't think that would lead anywhere,
> > upstream kernel has the "memory compaction" meanwhile,
> > so it should have become much less likely to hit this situation.
> > 
> Upstream kernel being from what version on?
> Note that even with the above setting the last time it failed to get an
> order 5 alloc (128MB) things looked like this:

128 *k*, of course.

> ---
> Node 0 Normal: 66757*4kB 692*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 276660kB
> Node 1 Normal: 25424*4kB 21411*8kB 5130*16kB 1037*32kB 96*64kB 26*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 402072kB
> ---
> 
> So on the node where it counted, lots of tiny fragments (and just about
> half the free memory to boot).
> 
> > Part of the issue was that there is no "physically contiguous" memory
> > available: even though we have free memory, it is too fragmented.
> > 
> > The "compaction" should cause "defragmentation" during normal
> > allocations, making it much less likely to fail atomic allocations due
> > to fragmentation.
> > 
> If that is supposed to deliver the same results as a forced compaction, I
> don't see it work, as the 3.2 tests I posted in the thread last year and
> current ones with 3.4 suggest:
> ---
> # cat /sys/kernel/debug/extfrag/unusable_index 
> Node 0, zone      DMA 0.000 0.000 0.000 0.000 0.000 0.008 0.016 0.032 0.032 0.097 0.226 
> Node 0, zone    DMA32 0.000 0.161 0.212 0.256 0.283 0.329 0.514 0.744 0.939 1.000 1.000 
> Node 0, zone   Normal 0.000 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 
> Node 1, zone   Normal 0.000 0.297 0.783 0.889 0.935 0.963 0.985 0.989 0.989 0.989 0.989 
> # echo 1 > /proc/sys/vm/compact_memory
> # cat /sys/kernel/debug/extfrag/unusable_index 
> Node 0, zone      DMA 0.000 0.000 0.000 0.000 0.000 0.008 0.016 0.032 0.032 0.097 0.226 
> Node 0, zone    DMA32 0.000 0.032 0.055 0.092 0.189 0.324 0.516 0.751 0.940 0.985 1.000 
> Node 0, zone   Normal 0.000 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 
> Node 1, zone   Normal 0.000 0.304 0.798 0.902 0.940 0.964 0.985 0.989 0.989 0.989 0.989 
> ---
> Not any real improvement where it counts.

Post that to the mm lists.
You want to complain to the right people.

> > "just use a more recent kernel" should help as well, already

I was just saying that the situation should have improved (compared with
kernels that don't even know about compaction),
and likely will keep improving (free memory fragmentation does affect
other things and performance in general). I didn't say it was "fixed".

Still this "compaction" is an interesting problem,
not all pages can be "migrated" freely for various reasons.

> When choosing a kernel for a new system/cluster I try to pick the latest
> "longterm" one that works with the userspace tools of the current distro
> release I'm using. 
> And unless something absolutely requires me to do otherwise, these
> machines will stay up as long as possible, in some cases until their
> replacement (5 years).

And for exactly that reason there are a lot of people using
"age old" kernels *today*; and those occasionally need the hint
that the idea of upgrading sometimes is at least worth considering.

> So that's 3.4 at this time, if somebody convinces me that 3.10 will turn
> longterm I could give that a try and see if it plays nice with Wheezy
> userland tools when building the next cluster.
> 
> Regards,
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list