Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, 20 Mar 2012 17:04:48 +0100 Lars Ellenberg wrote: > On Mon, Mar 19, 2012 at 05:04:44PM +0900, Christian Balzer wrote: > > > > Hi Florian, > > > > On Fri, 16 Mar 2012 13:55:17 +0100 Florian Haas wrote: > > > > > On Wed, Mar 14, 2012 at 7:48 AM, Christian Balzer <chibi at gol.com> > > > wrote: > > > > Hello, > > > > > > > > This is basically a repeat of: > > > > http://lists.linbit.com/pipermail/drbd-user/2011-August/016758.html > > > > > > > > 32GB RAM, Debian Squeeze, 3.2 (debian backport) kernel, 8.3.12 > > > > DRBD, IPOIB in connected mode with a 64k MTU. Just 2 DRBD > > > > resources. > > > > > > > > After encountering this for the first time (never showed up in two > > > > weeks of stress testing, which only goes to prove that real life > > > > just can't be simulated) I found the above article and changed the > > > > following sysctls: > > > > > > > > vm/min_free_kbytes = 262144 > > [snip] > > > > > > > > Lars hinted at "atomic reserves" in his reply, which particular > > > > parameters are we talking about here? > > > > > > I had hoped for Lars to pitch in here, but I guess I'll give it a go > > > instead. Note I'm certainly no kernel memory management expert, but > > > I'm not aware of anything that would fit that description other than > > > the vm.min_free_kbytes sysctl you've already mentioned. > > > > > Yeah, that was my assumption, too. > > Well, no. Or rather, "it depends". > > The trace you posted contains tcp_sendmsg, so from the send path. > > In the *receive* path, the min_free_kbytes actually make a difference. > In the *send* path, typically it does not, because we are not in > "atomic" context, but may block/sleep, and thus this reserve should > normally not be touched. > OK, that makes sense. > Also, the problem is not insufficient free memory, but insufficient > free memory of the desired "order". Put it differently: problem > is memory fragmentation. > > So you need to look into memory "defragmentation", which is better > known as "memory compaction" in the linux kernel. > > Relevant sysctls: > compact_memory (trigger to do an ad-hoc compaction run) > extfrag_threshold, probably a few more. > Well, I tried and succeeded to trigger that allocation failure the best way I know how to (a "du -s" of the drbd resource, slowly growing the slab as in inode and dentry caches and thus putting pressure on the VM system). When starting out, about 30GB were in use and I monitored /sys/kernel/debug/extfrag/extfrag_index, which looked pretty much like this: Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.999 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 1, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 with the highest order of DMA32 slowly shrinking to 0.996. But I assume that DMA32 isn't used in this case and none of the other values ever changed from -1 (which supposedly means no fragmentation or shortage). When the failure occurred the VM dropped 9GB of pagecache on the floor (used memory down to 21GB) and obviously was able to satisfy its needs after that. So judging from the extfrag_index there is no fragmentation, or at least changing the threshold won't do me any good as none of the values ever rose over 1. However this paints a slightly different and more grim picture: # cat /sys/kernel/debug/extfrag/unusable_index Node 0, zone DMA 0.000 0.000 0.000 0.001 0.001 0.009 0.017 0.033 0.033 0.097 0.226 Node 0, zone DMA32 0.000 0.421 0.658 0.835 0.941 0.978 0.990 0.993 0.994 0.994 1.000 Node 0, zone Normal 0.000 0.039 0.115 0.231 0.377 0.548 0.707 0.845 0.940 0.979 0.997 Node 1, zone Normal 0.000 0.090 0.250 0.442 0.610 0.747 0.853 0.934 0.983 0.997 0.997 The closer the value is to 1, the more fragmented and unavailable memory of that size is. Issuing a compact_memory run doesn't change it much for the top 3 sizes. > Or you need to fix the drivers to not require higher order page > allocation, but be ok with just some single pages scattered around. > Hacking kernel code is definitely beyond me. ^o^ Also from what I gleamed of a discussion on the infiniband ML about this issue the allocation is part of the standard TCP kernel bits, triggered most likely by the 64KB MTU. So I guess I'm still stuck at considering these events "harmless". Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/