[DRBD-user] Page allocation failure

Fri Apr 17 00:31:22 CEST 2009

On Thu, Apr 16, 2009 at 03:35:10PM -0400, Gennadiy Nerubayev wrote:
> I have a dual primary drbd setup with "sndbuf-size 0" set on both nodes,
> however one node has two gigabytes of memory, and the other has eight.
> Nearly all of the memory is used by the buffer cache due to the target
> caching. If the node that has 2gb is the currently "active" (as in actual
> traffic is only done to one of them), the attached kernel message is
> triggered during heavy I/O operations from the initiator. I have not been
> able to replicate this when the 8gb node is active - could we conclude that
> 2gb is too little in such a setup?

no, I don't think so.
it should definetely be possible to tune this so that it will work
without tripping over page allocation failures deep in the tcp stack.

> Thanks,
> 
> -Gennadiy

> drbd0_receiver: page allocation failure. order:4, mode:0x20

this was an "order 4" page allocation.  that means it tries to allocate
16 adjacent pages, 64kB, while being in "atomic" context (mode:0x20).
and there may still be free (or free-able) pages somewhere,
but too fragmented, or not free-able from the given context.

> Pid: 12266, comm: drbd0_receiver Not tainted 2.6.29.1 #7
> Call Trace:
>  [<ffffffff8027a1ef>] __alloc_pages_internal+0x3a4/0x3bd
>  [<ffffffff8029a181>] kmem_getpages+0x6b/0x139
>  [<ffffffff8029a726>] fallback_alloc+0x13b/0x1b1
>  [<ffffffff8029ab50>] __kmalloc+0xc8/0xf1
>  [<ffffffff804a54a3>] pskb_expand_head+0x4f/0x14e
>  [<ffffffff804a5633>] __pskb_pull_tail+0x57/0x291

from the kernel source code:
  /* Moves tail of skb head forward, copying data from fragmented part,
   * when it is necessary.
   * 1. It may fail due to malloc failure.
   * 2. It may change skb pointers.
   *
   * It is pretty complicated. Luckily, it is called only in exceptional
   * cases.
   */
  unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta)

I'd say key is to tune the system so these exceptional cases become
exceptionally unlikely.

you probably need to tune the buffer cache to do earlier write-out.
which would make much of the buffer cache free-able on emergency memory
pressure without write-out: non-dirty cache pages can simply be dropped
from the cache.

and you need to tune the tcp (or generic network) stack to reserve
more memory for the socket buffers even under memory pressure,
so it won't need to call out into __pskb_pull_tail.

it may help (or not) to play with certain NIC offloading settings, like
checksumming, sg, or segmentation offloading, as these influence the
network stack allocation behaviour.

try some sysctl:
(tune these down)
vm.dirty_expire_centisecs
vm.dirty_writeback_centisecs
vm.dirty_ratio
vm.dirty_background_ratio

(tune these up)
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
net.core.rmem_max
net.core.wmem_max

there may be more I don't remember right now.

>  [<ffffffff804ae5e1>] dev_queue_xmit+0xac/0x447
>  [<ffffffff804d010b>] ip_queue_xmit+0x2c5/0x31a
>  [<ffffffff8020c9ee>] apic_timer_interrupt+0xe/0x20
>  [<ffffffff804dfe87>] tcp_transmit_skb+0x5f2/0x62f
>  [<ffffffff804e13eb>] tcp_write_xmit+0x820/0x8fa
>  [<ffffffff804df576>] tcp_current_mss+0xa8/0xca
>  [<ffffffff804e14e7>] __tcp_push_pending_frames+0x22/0x77
>  [<ffffffff804def2c>] tcp_rcv_established+0x4e0/0x568
>  [<ffffffff804e3db7>] tcp_v4_do_rcv+0x2c/0x1d3
>  [<ffffffff8024b466>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff804d4f5e>] tcp_prequeue_process+0x69/0x7c
>  [<ffffffff804d7504>] tcp_recvmsg+0x3a2/0x78f
>  [<ffffffff804a13b5>] sock_common_recvmsg+0x30/0x45
>  [<ffffffff8049f6fb>] sock_recvmsg+0xf0/0x10f
>  [<ffffffff8049f6fb>] sock_recvmsg+0xf0/0x10f
>  [<ffffffffa02acf59>] __drbd_set_state+0xce/0xdc6 [drbd]
>  [<ffffffff8024b466>] autoremove_wake_function+0x0/0x2e
>  [<ffffffffa0065eea>] aac_scsi_cmd+0xb1f/0x109e [aacraid]
>  [<ffffffff8022ce6f>] target_load+0x24/0x4f
>  [<ffffffff8022dab6>] enqueue_task+0x48/0x51
>  [<ffffffff8022dad9>] activate_task+0x1a/0x20
>  [<ffffffff80231a33>] try_to_wake_up+0x262/0x274
>  [<ffffffffa029c4ab>] drbd_recv+0x71/0x110 [drbd]
>  [<ffffffffa029c7e6>] drbd_recv_header+0x16/0xa2 [drbd]
>  [<ffffffffa029cf0e>] drbdd+0x28/0x154 [drbd]
>  [<ffffffffa029fc63>] drbdd_init+0xbe/0x104 [drbd]
>  [<ffffffffa02b09b4>] drbd_thread_setup+0x115/0x193 [drbd]
>  [<ffffffff8020ceba>] child_rip+0xa/0x20
>  [<ffffffffa02b089f>] drbd_thread_setup+0x0/0x193 [drbd]
>  [<ffffffff8020ceb0>] child_rip+0x0/0x20
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> CPU    1: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:  186, btch:  31 usd: 185
> CPU    1: hi:  186, btch:  31 usd: 141
> Active_anon:9257 active_file:4093 inactive_anon:8664
>  inactive_file:389059 unevictable:1348 dirty:20548 writeback:2039 unstable:0
>  free:3560 slab:34427 mapped:2156 pagetables:1059 bounce:0
> Node 0 DMA free:8004kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:196kB unevictable:0kB present:7304kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 2003 2003 2003
> Node 0 DMA32 free:6824kB min:5712kB low:7140kB high:8568kB active_anon:37028kB inactive_anon:34656kB active_file:16368kB inactive_file:1555644kB unevictable:5392kB present:2051244kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 4*4kB 3*8kB 4*16kB 3*32kB 4*64kB 5*128kB 1*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 8008kB
> Node 0 DMA32: 1062*4kB 230*8kB 6*16kB 1*32kB 2*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 6856kB
> 395857 total pagecache pages
> 1732 pages in swap cache
> Swap cache stats: add 5558, delete 3826, find 39369/39711
> Free swap  = 3999300kB
> Total swap = 4008176kB
> 524000 pages RAM
> 42048 pages reserved
> 401850 pages shared
> 77174 pages non-shared

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed