Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Jul 18, 2013 at 10:11:15AM +0900, Christian Balzer wrote: > On Wed, 17 Jul 2013 11:27:23 +0200 Lars Ellenberg wrote: > > > On Wed, Jul 17, 2013 at 05:25:13PM +0900, Christian Balzer wrote: > > > > > > > > > On a very busy cluster with kernel 3.4.48 and DRBD 8.4.3 I was able to > > > reduce these kernel messages from dozens a day to nearly none by > > > setting > > > > > > vm/min_free_kbytes = 524288 > > > > Yes, that's a setting that should typically help. > > > > > Lars, as this keeps popping up and always suggests DRBD to be guilty > > > party even if it's not, I wonder if you guys should have some back > > > channel talk with the relevant people on the kernel ML... > > > > I don't think that would lead anywhere, > > upstream kernel has the "memory compaction" meanwhile, > > so it should have become much less likely to hit this situation. > > > Upstream kernel being from what version on? > Note that even with the above setting the last time it failed to get an > order 5 alloc (128MB) things looked like this: 128 *k*, of course. > --- > Node 0 Normal: 66757*4kB 692*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 276660kB > Node 1 Normal: 25424*4kB 21411*8kB 5130*16kB 1037*32kB 96*64kB 26*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 402072kB > --- > > So on the node where it counted, lots of tiny fragments (and just about > half the free memory to boot). > > > Part of the issue was that there is no "physically contiguous" memory > > available: even though we have free memory, it is too fragmented. > > > > The "compaction" should cause "defragmentation" during normal > > allocations, making it much less likely to fail atomic allocations due > > to fragmentation. > > > If that is supposed to deliver the same results as a forced compaction, I > don't see it work, as the 3.2 tests I posted in the thread last year and > current ones with 3.4 suggest: > --- > # cat /sys/kernel/debug/extfrag/unusable_index > Node 0, zone DMA 0.000 0.000 0.000 0.000 0.000 0.008 0.016 0.032 0.032 0.097 0.226 > Node 0, zone DMA32 0.000 0.161 0.212 0.256 0.283 0.329 0.514 0.744 0.939 1.000 1.000 > Node 0, zone Normal 0.000 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 > Node 1, zone Normal 0.000 0.297 0.783 0.889 0.935 0.963 0.985 0.989 0.989 0.989 0.989 > # echo 1 > /proc/sys/vm/compact_memory > # cat /sys/kernel/debug/extfrag/unusable_index > Node 0, zone DMA 0.000 0.000 0.000 0.000 0.000 0.008 0.016 0.032 0.032 0.097 0.226 > Node 0, zone DMA32 0.000 0.032 0.055 0.092 0.189 0.324 0.516 0.751 0.940 0.985 1.000 > Node 0, zone Normal 0.000 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 0.984 > Node 1, zone Normal 0.000 0.304 0.798 0.902 0.940 0.964 0.985 0.989 0.989 0.989 0.989 > --- > Not any real improvement where it counts. Post that to the mm lists. You want to complain to the right people. > > "just use a more recent kernel" should help as well, already I was just saying that the situation should have improved (compared with kernels that don't even know about compaction), and likely will keep improving (free memory fragmentation does affect other things and performance in general). I didn't say it was "fixed". Still this "compaction" is an interesting problem, not all pages can be "migrated" freely for various reasons. > When choosing a kernel for a new system/cluster I try to pick the latest > "longterm" one that works with the userspace tools of the current distro > release I'm using. > And unless something absolutely requires me to do otherwise, these > machines will stay up as long as possible, in some cases until their > replacement (5 years). And for exactly that reason there are a lot of people using "age old" kernels *today*; and those occasionally need the hint that the idea of upgrading sometimes is at least worth considering. > So that's 3.4 at this time, if somebody convinces me that 3.10 will turn > longterm I could give that a try and see if it plays nice with Wheezy > userland tools when building the next cluster. > > Regards, > > Christian > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed