Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Apr 16, 2009 at 03:35:10PM -0400, Gennadiy Nerubayev wrote: > I have a dual primary drbd setup with "sndbuf-size 0" set on both nodes, > however one node has two gigabytes of memory, and the other has eight. > Nearly all of the memory is used by the buffer cache due to the target > caching. If the node that has 2gb is the currently "active" (as in actual > traffic is only done to one of them), the attached kernel message is > triggered during heavy I/O operations from the initiator. I have not been > able to replicate this when the 8gb node is active - could we conclude that > 2gb is too little in such a setup? no, I don't think so. it should definetely be possible to tune this so that it will work without tripping over page allocation failures deep in the tcp stack. > Thanks, > > -Gennadiy > drbd0_receiver: page allocation failure. order:4, mode:0x20 this was an "order 4" page allocation. that means it tries to allocate 16 adjacent pages, 64kB, while being in "atomic" context (mode:0x20). and there may still be free (or free-able) pages somewhere, but too fragmented, or not free-able from the given context. > Pid: 12266, comm: drbd0_receiver Not tainted 2.6.29.1 #7 > Call Trace: > [<ffffffff8027a1ef>] __alloc_pages_internal+0x3a4/0x3bd > [<ffffffff8029a181>] kmem_getpages+0x6b/0x139 > [<ffffffff8029a726>] fallback_alloc+0x13b/0x1b1 > [<ffffffff8029ab50>] __kmalloc+0xc8/0xf1 > [<ffffffff804a54a3>] pskb_expand_head+0x4f/0x14e > [<ffffffff804a5633>] __pskb_pull_tail+0x57/0x291 from the kernel source code: /* Moves tail of skb head forward, copying data from fragmented part, * when it is necessary. * 1. It may fail due to malloc failure. * 2. It may change skb pointers. * * It is pretty complicated. Luckily, it is called only in exceptional * cases. */ unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta) I'd say key is to tune the system so these exceptional cases become exceptionally unlikely. you probably need to tune the buffer cache to do earlier write-out. which would make much of the buffer cache free-able on emergency memory pressure without write-out: non-dirty cache pages can simply be dropped from the cache. and you need to tune the tcp (or generic network) stack to reserve more memory for the socket buffers even under memory pressure, so it won't need to call out into __pskb_pull_tail. it may help (or not) to play with certain NIC offloading settings, like checksumming, sg, or segmentation offloading, as these influence the network stack allocation behaviour. try some sysctl: (tune these down) vm.dirty_expire_centisecs vm.dirty_writeback_centisecs vm.dirty_ratio vm.dirty_background_ratio (tune these up) net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.core.rmem_max net.core.wmem_max there may be more I don't remember right now. > [<ffffffff804ae5e1>] dev_queue_xmit+0xac/0x447 > [<ffffffff804d010b>] ip_queue_xmit+0x2c5/0x31a > [<ffffffff8020c9ee>] apic_timer_interrupt+0xe/0x20 > [<ffffffff804dfe87>] tcp_transmit_skb+0x5f2/0x62f > [<ffffffff804e13eb>] tcp_write_xmit+0x820/0x8fa > [<ffffffff804df576>] tcp_current_mss+0xa8/0xca > [<ffffffff804e14e7>] __tcp_push_pending_frames+0x22/0x77 > [<ffffffff804def2c>] tcp_rcv_established+0x4e0/0x568 > [<ffffffff804e3db7>] tcp_v4_do_rcv+0x2c/0x1d3 > [<ffffffff8024b466>] autoremove_wake_function+0x0/0x2e > [<ffffffff804d4f5e>] tcp_prequeue_process+0x69/0x7c > [<ffffffff804d7504>] tcp_recvmsg+0x3a2/0x78f > [<ffffffff804a13b5>] sock_common_recvmsg+0x30/0x45 > [<ffffffff8049f6fb>] sock_recvmsg+0xf0/0x10f > [<ffffffff8049f6fb>] sock_recvmsg+0xf0/0x10f > [<ffffffffa02acf59>] __drbd_set_state+0xce/0xdc6 [drbd] > [<ffffffff8024b466>] autoremove_wake_function+0x0/0x2e > [<ffffffffa0065eea>] aac_scsi_cmd+0xb1f/0x109e [aacraid] > [<ffffffff8022ce6f>] target_load+0x24/0x4f > [<ffffffff8022dab6>] enqueue_task+0x48/0x51 > [<ffffffff8022dad9>] activate_task+0x1a/0x20 > [<ffffffff80231a33>] try_to_wake_up+0x262/0x274 > [<ffffffffa029c4ab>] drbd_recv+0x71/0x110 [drbd] > [<ffffffffa029c7e6>] drbd_recv_header+0x16/0xa2 [drbd] > [<ffffffffa029cf0e>] drbdd+0x28/0x154 [drbd] > [<ffffffffa029fc63>] drbdd_init+0xbe/0x104 [drbd] > [<ffffffffa02b09b4>] drbd_thread_setup+0x115/0x193 [drbd] > [<ffffffff8020ceba>] child_rip+0xa/0x20 > [<ffffffffa02b089f>] drbd_thread_setup+0x0/0x193 [drbd] > [<ffffffff8020ceb0>] child_rip+0x0/0x20 > Mem-Info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > CPU 1: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 185 > CPU 1: hi: 186, btch: 31 usd: 141 > Active_anon:9257 active_file:4093 inactive_anon:8664 > inactive_file:389059 unevictable:1348 dirty:20548 writeback:2039 unstable:0 > free:3560 slab:34427 mapped:2156 pagetables:1059 bounce:0 > Node 0 DMA free:8004kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:196kB unevictable:0kB present:7304kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 2003 2003 2003 > Node 0 DMA32 free:6824kB min:5712kB low:7140kB high:8568kB active_anon:37028kB inactive_anon:34656kB active_file:16368kB inactive_file:1555644kB unevictable:5392kB present:2051244kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 4*4kB 3*8kB 4*16kB 3*32kB 4*64kB 5*128kB 1*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 8008kB > Node 0 DMA32: 1062*4kB 230*8kB 6*16kB 1*32kB 2*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 6856kB > 395857 total pagecache pages > 1732 pages in swap cache > Swap cache stats: add 5558, delete 3826, find 39369/39711 > Free swap = 3999300kB > Total swap = 4008176kB > 524000 pages RAM > 42048 pages reserved > 401850 pages shared > 77174 pages non-shared -- : Lars Ellenberg : LINBIT HA-Solutions GmbH : DRBD®/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed