Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote: > On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote: >> Hello all, >> >> We're experiencing page allocation errors when writing to a connected >> drbd device (connected via infiniband using IPoIB) that resulted in a >> kernel panic last night. >> >> When rsyncing data to the primary node, we get a slew of these >> errors in /var/log/messages: >> >> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation >> failure. order:5, mode:0x10 >> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker >> Tainted: P 2.6.27.25 #2 >> Sep 21 22:11:21 client-nfs5 kernel: >> Sep 21 22:11:21 client-nfs5 kernel: Call Trace: >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8026eec4>] >> __alloc_pages_internal+0x3a4/0x3c0 >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c61a>] >> kmem_getpages+0x6b/0x12b >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028cb93>] >> fallback_alloc+0x11d/0x1b1 >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c816>] >> kmem_cache_alloc_node+0xa3/0xcf >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803ecf65>] >> __alloc_skb+0x64/0x12e >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041af6f>] >> sk_stream_alloc_skb+0x2f/0xd5 >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041beb0>] >> tcp_sendmsg+0x180/0x9d1 >> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803e6d85>] >> sock_sendmsg+0xe2/0xff >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>] >> autoremove_wake_function+0x0/0x2e >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>] >> autoremove_wake_function+0x0/0x2e >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8027464b>] >> zone_statistics+0x3a/0x5d >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff803e82f6>] >> kernel_sendmsg+0x2c/0x3e >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0568d83>] drbd_send >> +0xb2/0x194 [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569134>] >> _drbd_send_cmd+0x9c/0x116 [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569285>] >> send_bitmap_rle_or_plain+0xd7/0x13a [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569475>] >> _drbd_send_bitmap+0x18d/0x1ae [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056c242>] >> drbd_send_bitmap+0x39/0x4c [drbd] > > I certainly would not expect to _cause_ > memory pressure from this call path. > But someone is causing it, and we are affected. > > >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b9a1>] >> w_bitmap_io+0x45/0x95 [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa05560c4>] >> drbd_worker+0x230/0x3eb [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b0d9>] >> drbd_thread_setup+0x124/0x1ba [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd59>] child_rip >> +0xa/0x11 >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056afb5>] >> drbd_thread_setup+0x0/0x1ba [drbd] >> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd4f>] child_rip >> +0x0/0x11 > > why is the tcp stack trying to allocate order:5 pages? > > thats 32 continguous pages, 128 KiB continguous memory. apparently > your > memory is so much fragmented that this is no longer available. > but why is the tcp stack ok with order:0 or order:1 pages? > why does it have to be order:5 ? > >> These were occurring for hours until a kernel panic: >> >> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for >> more than 120 seconds. >> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/ >> hung_task_timeout_secs" disables this message. >> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046 >> ffff88062e801cf8 ffffffffa056b821 >> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120 >> ffff88063e668750 ffff88063dd05478 >> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d >> ffffffffffffffff ffffffffffffffff >> [Mon Sep 21 22:11:33 2009]Call Trace: >> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn >> +0x14a/0x1aa [drbd] >> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f >> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f >> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f >> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit >> +0x40/0x6f >> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f >> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>] >> out_of_line_wait_on_bit+0x6c/0x78 >> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function >> +0x0/0x23 >> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>] >> journal_commit_transaction+0x7cd/0xc4d [jbd] >> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base >> +0x26/0x4b >> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald >> +0xc1/0x1fb [jbd] >> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>] >> autoremove_wake_function+0x0/0x2e >> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb >> [jbd] >> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73 >> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11 >> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73 >> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11 >> [Mon Sep 21 22:11:34 2009] >> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup: >> blocked tasks >> [-- root at localhost.localdomain attached -- Mon Sep 21 22:17:24 2009] >> [-- root at localhost.localdomain detached -- Mon Sep 21 23:16:22 2009] >> [-- Console down -- Tue Sep 22 04:02:15 2009] > > That is unfortunate. > >> The two systems are hardware and OS identical: >> >> [root at client-nfs5 log]# uname -a >> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009 >> x86_64 x86_64 x86_64 GNU/Linux >> >> [root at client-nfs5 log]# grep model\ name /proc/cpuinfo >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> >> [root at client-nfs5 log]# free -m >> total used free shared buffers >> cached >> Mem: 24110 2050 22060 0 >> 157 1270 >> -/+ buffers/cache: 622 23488 >> Swap: 4094 0 4094 >> >> [root at client-nfs5 log]# cat /proc/partitions >> major minor #blocks name >> >> 8 0 1755310080 sda >> 8 1 1755302031 sda1 >> 8 16 4881116160 sdb >> 8 17 4881116126 sdb1 >> 8 32 62522712 sdc >> 8 33 58147236 sdc1 >> 8 34 4192965 sdc2 >> 147 0 1755248424 drbd0 >> 147 1 4880967128 drbd1 >> >> [root at client-nfs5 log]# modinfo drbd >> filename: /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko >> alias: block-major-147-* >> license: GPL >> version: 8.3.3rc1 > > Just in case, please retry with current git. > But it is more likely an issue with badly tuned tcp, or tcp via IPoIB. > >> description: drbd - Distributed Replicated Block Device v8.3.3rc1 >> author: Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com >> > >> srcversion: A06488A2009E5EB94AF0825 >> depends: >> vermagic: 2.6.27.25 SMP mod_unload modversions >> parm: minor_count:Maximum number of drbd devices (1-255) >> (uint) >> parm: disable_sendpage:bool >> parm: allow_oos:DONT USE! (bool) >> parm: cn_idx:uint >> parm: proc_details:int >> parm: enable_faults:int >> parm: fault_rate:int >> parm: fault_count:int >> parm: fault_devs:int >> parm: usermode_helper:string >> >> [root at client-nfs5 log]# cat /etc/drbd.conf >> global { >> usage-count no; >> } >> >> common { >> syncer { rate 800M; } #This is temporary > > please reduce, and only slowly increase again, until the cause of the > high memory pressure is identified. > maybe someone is leaking full pages? > with infiniband SDP, that in fact did happen. > > though, if there was no current resync going on, > this cannot be the cause. > >> protocol C; >> >> handlers { >> } >> >> startup { degr-wfc-timeout 60; >> wfc-timeout 60; >> } >> >> disk { >> on-io-error detach; >> fencing dont-care; >> no-disk-flushes; >> no-md-flushes; >> no-disk-barrier; >> } >> >> net { >> ko-count 2; >> after-sb-2pri disconnect; >> max-buffers 16000; >> max-epoch-size 16000; >> unplug-watermark 16001; >> sndbuf-size 8m; >> rcvbuf-size 8m; >> >> } >> } >> >> resource drbd0 { >> on client-nfs5 { >> device /dev/drbd0; >> disk /dev/sda1; >> address 192.168.16.48:7789; >> flexible-meta-disk internal; >> } >> >> on client-nfs6 { >> device /dev/drbd0; >> disk /dev/sda1; >> address 192.168.16.49:7789; >> flexible-meta-disk internal; >> } >> } >> >> resource drbd1 { >> on client-nfs5 { >> device /dev/drbd1; >> disk /dev/sdb1; >> address 192.168.16.48:7790; >> flexible-meta-disk internal; >> } >> >> on client-nfs6 { >> device /dev/drbd1; >> disk /dev/sdb1; >> address 192.168.16.49:7790; >> flexible-meta-disk internal; >> } >> } >> >> [root at client-nfs5 log]# ibstat >> CA 'mlx4_0' >> CA type: MT26428 >> Number of ports: 2 >> Firmware version: 2.6.100 >> Hardware version: a0 >> Node GUID: 0x0002c9030004a79a >> System image GUID: 0x0002c9030004a79d >> Port 1: >> State: Active >> Physical state: LinkUp >> Rate: 40 >> Base lid: 1 >> LMC: 0 >> SM lid: 1 >> Capability mask: 0x0251086a >> Port GUID: 0x0002c9030004a79b >> Port 2: >> State: Active >> Physical state: LinkUp >> Rate: 40 >> Base lid: 1 >> LMC: 0 >> SM lid: 1 >> Capability mask: 0x0251086a >> Port GUID: 0x0002c9030004a79c >> >> [root at client-nfs5 log]# cat /etc/sysctl.conf >> # Kernel sysctl configuration file for Red Hat Linux >> # >> # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and >> # sysctl.conf(5) for more details. >> >> # Controls IP packet forwarding >> net.ipv4.ip_forward = 0 >> >> # Controls source route verification >> net.ipv4.conf.default.rp_filter = 1 >> >> # Do not accept source routing >> net.ipv4.conf.default.accept_source_route = 0 >> >> # Controls the System Request debugging functionality of the kernel >> kernel.sysrq = 0 >> >> # Controls whether core dumps will append the PID to the core >> filename >> # Useful for debugging multi-threaded applications >> kernel.core_uses_pid = 1 >> >> # Controls the use of TCP syncookies >> net.ipv4.tcp_syncookies = 1 >> >> # Controls the maximum size of a message, in bytes >> kernel.msgmnb = 65536 >> >> # Controls the default maxmimum size of a mesage queue >> kernel.msgmax = 65536 >> >> # Controls the maximum shared segment size, in bytes >> kernel.shmmax = 68719476736 >> >> # Controls the maximum number of shared memory segments, in pages >> kernel.shmall = 4294967296 >> net.ipv4.tcp_timestamps=0 >> net.ipv4.tcp_sack=0 >> net.core.netdev_max_backlog=250000 >> net.core.rmem_max=16777216 >> net.core.wmem_max=16777216 >> net.core.rmem_default=16777216 >> net.core.wmem_default=16777216 >> net.core.optmem_max=16777216 > > > this: > >> net.ipv4.tcp_mem=16777216 16777216 16777216 > > ^^ is nonsense. > tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max > buffer > size you apparently want to set. > > I know the first hits on google for tcp_mem sysctl suggest otherwise. > They are misguided and wrong. > > tcp_mem controlls the watermarks for _global_ tcp stack memory usage, > where no pressure, little pressure, or high pressure is used against > the tcp stack, if it tries to allocate some more. > > what you say there is that you do not bother to throttle the tcp stack > memory usage at all, until its total usage would reach about > 16777216 * 4k, which is 64GiB, which is more than double the amount of > total ram in your box. > > DO NOT DO THAT. > > Or your box will run into hard OOM conditions, > invoke the oom-killer, and eventually will panic. > > Set these values to e.g. 10%, 15%, 30% of total RAM, > and remember the unit is _pages_ (4k). > >> net.ipv4.tcp_rmem=4096 87380 16777216 >> net.ipv4.tcp_wmem=4096 65536 16777216 > > here, however, I suggest to up the lower mark to about 32k or 64k, > which is the "guaranteed" amount of buffer space a tcp buffer is > promissed to be left without throttling it further, even under overall > high memory pressure. Ok, this makes sense. These settings were taken from the ofed 1.4.2 release notes and are actually set dynamically when openibd is started (via /sbin/ib_ipoib_sysctl)... We'll start on these, then. > >> Since we noticed these errors, we've been searching for them on our >> logservers and found another case where this is occurirng using the >> same drbd version and infiniband. This other case, however, uses a >> stock redhat kernel. >> >> Has anyone else seen these errors? Do the tuning recommendations >> for >> send/receive buffers need to be revisited? Would switching to rdp vs >> IPoIB be a potential fix here? Thanks in advance for any insights. > > I'm not sure about RDP in this context, I know it as > http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol > which is not available for DRBD. The context here is my bad typing. I meant SDP. > > SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol) > is available with DRBD 8.3.3 (rc*). > But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak > full pages. In our testing, we've been using the SDP from OFED 1.4.2. > > As SDP support is a new feature in DRBD, I'd like to have feedback on > how well it works for you, and how performance as well as CPU usage > compares to IPoIB. > > But correcting the tcp_mem setting above > is more likely to fix your symptoms. I suspect it will. We'll test and follow up. Much thanks for the quick reply. > > -- > : Lars Ellenberg > : LINBIT HA-Solutions GmbH > : DRBD®/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- Jason McKay Sr. Engineer Logicworks