Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:
> On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote:
>> Hello all,
>>
>> We're experiencing page allocation errors when writing to a connected
>> drbd device (connected via infiniband using IPoIB) that resulted in a
>> kernel panic last night.
>>
>> When rsyncing data to the primary node, we get a slew of these
>> errors in /var/log/messages:
>>
>> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation
>> failure. order:5, mode:0x10
>> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker
>> Tainted: P 2.6.27.25 #2
>> Sep 21 22:11:21 client-nfs5 kernel:
>> Sep 21 22:11:21 client-nfs5 kernel: Call Trace:
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8026eec4>]
>> __alloc_pages_internal+0x3a4/0x3c0
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c61a>]
>> kmem_getpages+0x6b/0x12b
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028cb93>]
>> fallback_alloc+0x11d/0x1b1
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c816>]
>> kmem_cache_alloc_node+0xa3/0xcf
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803ecf65>]
>> __alloc_skb+0x64/0x12e
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041af6f>]
>> sk_stream_alloc_skb+0x2f/0xd5
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041beb0>]
>> tcp_sendmsg+0x180/0x9d1
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803e6d85>]
>> sock_sendmsg+0xe2/0xff
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8027464b>]
>> zone_statistics+0x3a/0x5d
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff803e82f6>]
>> kernel_sendmsg+0x2c/0x3e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0568d83>] drbd_send
>> +0xb2/0x194 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569134>]
>> _drbd_send_cmd+0x9c/0x116 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569285>]
>> send_bitmap_rle_or_plain+0xd7/0x13a [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569475>]
>> _drbd_send_bitmap+0x18d/0x1ae [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056c242>]
>> drbd_send_bitmap+0x39/0x4c [drbd]
>
> I certainly would not expect to _cause_
> memory pressure from this call path.
> But someone is causing it, and we are affected.
>
>
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b9a1>]
>> w_bitmap_io+0x45/0x95 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa05560c4>]
>> drbd_worker+0x230/0x3eb [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b0d9>]
>> drbd_thread_setup+0x124/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd59>] child_rip
>> +0xa/0x11
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056afb5>]
>> drbd_thread_setup+0x0/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd4f>] child_rip
>> +0x0/0x11
>
> why is the tcp stack trying to allocate order:5 pages?
>
> thats 32 continguous pages, 128 KiB continguous memory. apparently
> your
> memory is so much fragmented that this is no longer available.
> but why is the tcp stack ok with order:0 or order:1 pages?
> why does it have to be order:5 ?
>
>> These were occurring for hours until a kernel panic:
>>
>> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for
>> more than 120 seconds.
>> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/
>> hung_task_timeout_secs" disables this message.
>> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046
>> ffff88062e801cf8 ffffffffa056b821
>> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120
>> ffff88063e668750 ffff88063dd05478
>> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d
>> ffffffffffffffff ffffffffffffffff
>> [Mon Sep 21 22:11:33 2009]Call Trace:
>> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn
>> +0x14a/0x1aa [drbd]
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit
>> +0x40/0x6f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>]
>> out_of_line_wait_on_bit+0x6c/0x78
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function
>> +0x0/0x23
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>]
>> journal_commit_transaction+0x7cd/0xc4d [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base
>> +0x26/0x4b
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald
>> +0xc1/0x1fb [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb
>> [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11
>> [Mon Sep 21 22:11:34 2009]
>> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup:
>> blocked tasks
>> [-- root at localhost.localdomain attached -- Mon Sep 21 22:17:24 2009]
>> [-- root at localhost.localdomain detached -- Mon Sep 21 23:16:22 2009]
>> [-- Console down -- Tue Sep 22 04:02:15 2009]
>
> That is unfortunate.
>
>> The two systems are hardware and OS identical:
>>
>> [root at client-nfs5 log]# uname -a
>> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root at client-nfs5 log]# grep model\ name /proc/cpuinfo
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>>
>> [root at client-nfs5 log]# free -m
>> total used free shared buffers
>> cached
>> Mem: 24110 2050 22060 0
>> 157 1270
>> -/+ buffers/cache: 622 23488
>> Swap: 4094 0 4094
>>
>> [root at client-nfs5 log]# cat /proc/partitions
>> major minor #blocks name
>>
>> 8 0 1755310080 sda
>> 8 1 1755302031 sda1
>> 8 16 4881116160 sdb
>> 8 17 4881116126 sdb1
>> 8 32 62522712 sdc
>> 8 33 58147236 sdc1
>> 8 34 4192965 sdc2
>> 147 0 1755248424 drbd0
>> 147 1 4880967128 drbd1
>>
>> [root at client-nfs5 log]# modinfo drbd
>> filename: /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko
>> alias: block-major-147-*
>> license: GPL
>> version: 8.3.3rc1
>
> Just in case, please retry with current git.
> But it is more likely an issue with badly tuned tcp, or tcp via IPoIB.
>
>> description: drbd - Distributed Replicated Block Device v8.3.3rc1
>> author: Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com
>> >
>> srcversion: A06488A2009E5EB94AF0825
>> depends:
>> vermagic: 2.6.27.25 SMP mod_unload modversions
>> parm: minor_count:Maximum number of drbd devices (1-255)
>> (uint)
>> parm: disable_sendpage:bool
>> parm: allow_oos:DONT USE! (bool)
>> parm: cn_idx:uint
>> parm: proc_details:int
>> parm: enable_faults:int
>> parm: fault_rate:int
>> parm: fault_count:int
>> parm: fault_devs:int
>> parm: usermode_helper:string
>>
>> [root at client-nfs5 log]# cat /etc/drbd.conf
>> global {
>> usage-count no;
>> }
>>
>> common {
>> syncer { rate 800M; } #This is temporary
>
> please reduce, and only slowly increase again, until the cause of the
> high memory pressure is identified.
> maybe someone is leaking full pages?
> with infiniband SDP, that in fact did happen.
>
> though, if there was no current resync going on,
> this cannot be the cause.
>
>> protocol C;
>>
>> handlers {
>> }
>>
>> startup { degr-wfc-timeout 60;
>> wfc-timeout 60;
>> }
>>
>> disk {
>> on-io-error detach;
>> fencing dont-care;
>> no-disk-flushes;
>> no-md-flushes;
>> no-disk-barrier;
>> }
>>
>> net {
>> ko-count 2;
>> after-sb-2pri disconnect;
>> max-buffers 16000;
>> max-epoch-size 16000;
>> unplug-watermark 16001;
>> sndbuf-size 8m;
>> rcvbuf-size 8m;
>>
>> }
>> }
>>
>> resource drbd0 {
>> on client-nfs5 {
>> device /dev/drbd0;
>> disk /dev/sda1;
>> address 192.168.16.48:7789;
>> flexible-meta-disk internal;
>> }
>>
>> on client-nfs6 {
>> device /dev/drbd0;
>> disk /dev/sda1;
>> address 192.168.16.49:7789;
>> flexible-meta-disk internal;
>> }
>> }
>>
>> resource drbd1 {
>> on client-nfs5 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 192.168.16.48:7790;
>> flexible-meta-disk internal;
>> }
>>
>> on client-nfs6 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 192.168.16.49:7790;
>> flexible-meta-disk internal;
>> }
>> }
>>
>> [root at client-nfs5 log]# ibstat
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 2
>> Firmware version: 2.6.100
>> Hardware version: a0
>> Node GUID: 0x0002c9030004a79a
>> System image GUID: 0x0002c9030004a79d
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 1
>> LMC: 0
>> SM lid: 1
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c9030004a79b
>> Port 2:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 1
>> LMC: 0
>> SM lid: 1
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c9030004a79c
>>
>> [root at client-nfs5 log]# cat /etc/sysctl.conf
>> # Kernel sysctl configuration file for Red Hat Linux
>> #
>> # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
>> # sysctl.conf(5) for more details.
>>
>> # Controls IP packet forwarding
>> net.ipv4.ip_forward = 0
>>
>> # Controls source route verification
>> net.ipv4.conf.default.rp_filter = 1
>>
>> # Do not accept source routing
>> net.ipv4.conf.default.accept_source_route = 0
>>
>> # Controls the System Request debugging functionality of the kernel
>> kernel.sysrq = 0
>>
>> # Controls whether core dumps will append the PID to the core
>> filename
>> # Useful for debugging multi-threaded applications
>> kernel.core_uses_pid = 1
>>
>> # Controls the use of TCP syncookies
>> net.ipv4.tcp_syncookies = 1
>>
>> # Controls the maximum size of a message, in bytes
>> kernel.msgmnb = 65536
>>
>> # Controls the default maxmimum size of a mesage queue
>> kernel.msgmax = 65536
>>
>> # Controls the maximum shared segment size, in bytes
>> kernel.shmmax = 68719476736
>>
>> # Controls the maximum number of shared memory segments, in pages
>> kernel.shmall = 4294967296
>> net.ipv4.tcp_timestamps=0
>> net.ipv4.tcp_sack=0
>> net.core.netdev_max_backlog=250000
>> net.core.rmem_max=16777216
>> net.core.wmem_max=16777216
>> net.core.rmem_default=16777216
>> net.core.wmem_default=16777216
>> net.core.optmem_max=16777216
>
>
> this:
>
>> net.ipv4.tcp_mem=16777216 16777216 16777216
>
> ^^ is nonsense.
> tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max
> buffer
> size you apparently want to set.
>
> I know the first hits on google for tcp_mem sysctl suggest otherwise.
> They are misguided and wrong.
>
> tcp_mem controlls the watermarks for _global_ tcp stack memory usage,
> where no pressure, little pressure, or high pressure is used against
> the tcp stack, if it tries to allocate some more.
>
> what you say there is that you do not bother to throttle the tcp stack
> memory usage at all, until its total usage would reach about
> 16777216 * 4k, which is 64GiB, which is more than double the amount of
> total ram in your box.
>
> DO NOT DO THAT.
>
> Or your box will run into hard OOM conditions,
> invoke the oom-killer, and eventually will panic.
>
> Set these values to e.g. 10%, 15%, 30% of total RAM,
> and remember the unit is _pages_ (4k).
>
>> net.ipv4.tcp_rmem=4096 87380 16777216
>> net.ipv4.tcp_wmem=4096 65536 16777216
>
> here, however, I suggest to up the lower mark to about 32k or 64k,
> which is the "guaranteed" amount of buffer space a tcp buffer is
> promissed to be left without throttling it further, even under overall
> high memory pressure.
Ok, this makes sense. These settings were taken from the ofed 1.4.2
release notes and are actually set dynamically when openibd is started
(via /sbin/ib_ipoib_sysctl)...
We'll start on these, then.
>
>> Since we noticed these errors, we've been searching for them on our
>> logservers and found another case where this is occurirng using the
>> same drbd version and infiniband. This other case, however, uses a
>> stock redhat kernel.
>>
>> Has anyone else seen these errors? Do the tuning recommendations
>> for
>> send/receive buffers need to be revisited? Would switching to rdp vs
>> IPoIB be a potential fix here? Thanks in advance for any insights.
>
> I'm not sure about RDP in this context, I know it as
> http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol
> which is not available for DRBD.
The context here is my bad typing. I meant SDP.
>
> SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol)
> is available with DRBD 8.3.3 (rc*).
> But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak
> full pages.
In our testing, we've been using the SDP from OFED 1.4.2.
>
> As SDP support is a new feature in DRBD, I'd like to have feedback on
> how well it works for you, and how performance as well as CPU usage
> compares to IPoIB.
>
> But correcting the tcp_mem setting above
> is more likely to fix your symptoms.
I suspect it will. We'll test and follow up.
Much thanks for the quick reply.
>
> --
> : Lars Ellenberg
> : LINBIT HA-Solutions GmbH
> : DRBD®/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
--
Jason McKay
Sr. Engineer
Logicworks