[DRBD-user] Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband

Tue Sep 22 22:34:34 CEST 2009

On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote:
> Hello all,
> 
> We're experiencing page allocation errors when writing to a connected
> drbd device (connected via infiniband using IPoIB) that resulted in a
> kernel panic last night.
> 
> When rsyncing data to the primary node, we get a slew of these errors in /var/log/messages:
> 
> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation failure. order:5, mode:0x10
> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker Tainted: P          2.6.27.25 #2
> Sep 21 22:11:21 client-nfs5 kernel:
> Sep 21 22:11:21 client-nfs5 kernel: Call Trace:
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8026eec4>] __alloc_pages_internal+0x3a4/0x3c0
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028c61a>] kmem_getpages+0x6b/0x12b
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028cb93>] fallback_alloc+0x11d/0x1b1
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028c816>] kmem_cache_alloc_node+0xa3/0xcf
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff803ecf65>] __alloc_skb+0x64/0x12e
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8041af6f>] sk_stream_alloc_skb+0x2f/0xd5
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8041beb0>] tcp_sendmsg+0x180/0x9d1
> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff803e6d85>] sock_sendmsg+0xe2/0xff
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8027464b>] zone_statistics+0x3a/0x5d
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff803e82f6>] kernel_sendmsg+0x2c/0x3e
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0568d83>] drbd_send+0xb2/0x194 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569134>] _drbd_send_cmd+0x9c/0x116 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569285>] send_bitmap_rle_or_plain+0xd7/0x13a [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569475>] _drbd_send_bitmap+0x18d/0x1ae [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056c242>] drbd_send_bitmap+0x39/0x4c [drbd]

I certainly would not expect to _cause_
memory pressure from this call path.
But someone is causing it, and we are affected.

> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056b9a1>] w_bitmap_io+0x45/0x95 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa05560c4>] drbd_worker+0x230/0x3eb [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056b0d9>] drbd_thread_setup+0x124/0x1ba [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8020cd59>] child_rip+0xa/0x11
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056afb5>] drbd_thread_setup+0x0/0x1ba [drbd]
> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8020cd4f>] child_rip+0x0/0x11

why is the tcp stack trying to allocate order:5 pages?

thats 32 continguous pages, 128 KiB continguous memory.  apparently your
memory is so much fragmented that this is no longer available.
but why is the tcp stack ok with order:0 or order:1 pages?
why does it have to be order:5 ?

> These were occurring for hours until a kernel panic:
> 
> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for more than 120 seconds.
> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046 ffff88062e801cf8 ffffffffa056b821
> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120 ffff88063e668750 ffff88063dd05478
> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d ffffffffffffffff ffffffffffffffff
> [Mon Sep 21 22:11:33 2009]Call Trace:
> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn+0x14a/0x1aa [drbd]
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f
> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit+0x40/0x6f
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>] out_of_line_wait_on_bit+0x6c/0x78
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function+0x0/0x23
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>] journal_commit_transaction+0x7cd/0xc4d [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base+0x26/0x4b
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald+0xc1/0x1fb [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73
> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73
> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11
> [Mon Sep 21 22:11:34 2009]
> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup: blocked tasks
> [-- root at localhost.localdomain attached -- Mon Sep 21 22:17:24 2009]
> [-- root at localhost.localdomain detached -- Mon Sep 21 23:16:22 2009]
> [-- Console down -- Tue Sep 22 04:02:15 2009]

That is unfortunate.

> The two systems are hardware and OS identical:
> 
> [root at client-nfs5 log]# uname -a
> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root at client-nfs5 log]# grep model\ name /proc/cpuinfo 
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> 
> [root at client-nfs5 log]# free -m
>              total       used       free     shared    buffers     cached
> Mem:         24110       2050      22060          0        157       1270
> -/+ buffers/cache:        622      23488
> Swap:         4094          0       4094
> 
> [root at client-nfs5 log]# cat /proc/partitions 
> major minor  #blocks  name
> 
>    8     0 1755310080 sda
>    8     1 1755302031 sda1
>    8    16 4881116160 sdb
>    8    17 4881116126 sdb1
>    8    32   62522712 sdc
>    8    33   58147236 sdc1
>    8    34    4192965 sdc2
>  147     0 1755248424 drbd0
>  147     1 4880967128 drbd1
> 
> [root at client-nfs5 log]# modinfo drbd
> filename:       /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko
> alias:          block-major-147-*
> license:        GPL
> version:        8.3.3rc1

Just in case, please retry with current git.
But it is more likely an issue with badly tuned tcp, or tcp via IPoIB.

> description:    drbd - Distributed Replicated Block Device v8.3.3rc1
> author:         Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com>
> srcversion:     A06488A2009E5EB94AF0825
> depends:        
> vermagic:       2.6.27.25 SMP mod_unload modversions 
> parm:           minor_count:Maximum number of drbd devices (1-255) (uint)
> parm:           disable_sendpage:bool
> parm:           allow_oos:DONT USE! (bool)
> parm:           cn_idx:uint
> parm:           proc_details:int
> parm:           enable_faults:int
> parm:           fault_rate:int
> parm:           fault_count:int
> parm:           fault_devs:int
> parm:           usermode_helper:string
> 
> [root at client-nfs5 log]# cat /etc/drbd.conf 
> global {
>   usage-count no;
> }
> 
> common {
>   syncer { rate 800M; }  #This is temporary

please reduce, and only slowly increase again, until the cause of the
high memory pressure is identified.
maybe someone is leaking full pages?
with infiniband SDP, that in fact did happen.

though, if there was no current resync going on,
this cannot be the cause.

>   protocol C;
> 
>   handlers {
>   }
> 
>   startup {    degr-wfc-timeout        60;
>     wfc-timeout             60;
>   }
> 
>   disk {
>     on-io-error     detach;
>     fencing         dont-care;
>     no-disk-flushes;
>     no-md-flushes;
>     no-disk-barrier;
>   }
> 
>   net {
>     ko-count        2;
>     after-sb-2pri   disconnect;
>     max-buffers 16000;
>     max-epoch-size 16000;
>     unplug-watermark 16001;
>     sndbuf-size 8m;
>     rcvbuf-size 8m;
> 
>   }
> }
> 
> resource drbd0 {
>   on client-nfs5 {
>     device  /dev/drbd0;
>     disk    /dev/sda1;
>     address  192.168.16.48:7789;
>     flexible-meta-disk internal;
>   }
> 
>   on client-nfs6 {
>     device  /dev/drbd0;
>     disk    /dev/sda1;
>     address  192.168.16.49:7789;
>     flexible-meta-disk internal;
>   }
> }
> 
> resource drbd1 {
>   on client-nfs5 {
>     device  /dev/drbd1;
>     disk    /dev/sdb1;
>     address  192.168.16.48:7790;
>     flexible-meta-disk internal;
>   }
> 
>   on client-nfs6 {
>     device  /dev/drbd1;
>     disk    /dev/sdb1;
>     address  192.168.16.49:7790;
>     flexible-meta-disk internal;
>   }
> }
> 
> [root at client-nfs5 log]# ibstat
> CA 'mlx4_0'
> 	CA type: MT26428
> 	Number of ports: 2
> 	Firmware version: 2.6.100
> 	Hardware version: a0
> 	Node GUID: 0x0002c9030004a79a
> 	System image GUID: 0x0002c9030004a79d
> 	Port 1:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 40
> 		Base lid: 1
> 		LMC: 0
> 		SM lid: 1
> 		Capability mask: 0x0251086a
> 		Port GUID: 0x0002c9030004a79b
> 	Port 2:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 40
> 		Base lid: 1
> 		LMC: 0
> 		SM lid: 1
> 		Capability mask: 0x0251086a
> 		Port GUID: 0x0002c9030004a79c
> 
> [root at client-nfs5 log]# cat /etc/sysctl.conf 
> # Kernel sysctl configuration file for Red Hat Linux
> #
> # For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
> # sysctl.conf(5) for more details.
> 
> # Controls IP packet forwarding
> net.ipv4.ip_forward = 0
> 
> # Controls source route verification
> net.ipv4.conf.default.rp_filter = 1
> 
> # Do not accept source routing
> net.ipv4.conf.default.accept_source_route = 0
> 
> # Controls the System Request debugging functionality of the kernel
> kernel.sysrq = 0
> 
> # Controls whether core dumps will append the PID to the core filename
> # Useful for debugging multi-threaded applications
> kernel.core_uses_pid = 1
> 
> # Controls the use of TCP syncookies
> net.ipv4.tcp_syncookies = 1
> 
> # Controls the maximum size of a message, in bytes
> kernel.msgmnb = 65536
> 
> # Controls the default maxmimum size of a mesage queue
> kernel.msgmax = 65536
> 
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 68719476736
> 
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
> net.ipv4.tcp_timestamps=0
> net.ipv4.tcp_sack=0
> net.core.netdev_max_backlog=250000
> net.core.rmem_max=16777216
> net.core.wmem_max=16777216
> net.core.rmem_default=16777216
> net.core.wmem_default=16777216
> net.core.optmem_max=16777216

this:

> net.ipv4.tcp_mem=16777216 16777216 16777216

^^ is nonsense.
tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max buffer
size you apparently want to set.

I know the first hits on google for tcp_mem sysctl suggest otherwise.
They are misguided and wrong.

tcp_mem controlls the watermarks for _global_ tcp stack memory usage,
where no pressure, little pressure, or high pressure is used against
the tcp stack, if it tries to allocate some more.

what you say there is that you do not bother to throttle the tcp stack
memory usage at all, until its total usage would reach about
16777216 * 4k, which is 64GiB, which is more than double the amount of
total ram in your box.

DO NOT DO THAT.

Or your box will run into hard OOM conditions,
invoke the oom-killer, and eventually will panic.

Set these values to e.g. 10%, 15%, 30% of total RAM,
and remember the unit is _pages_ (4k).

> net.ipv4.tcp_rmem=4096 87380 16777216
> net.ipv4.tcp_wmem=4096 65536 16777216

here, however, I suggest to up the lower mark to about 32k or 64k,
which is the "guaranteed" amount of buffer space a tcp buffer is
promissed to be left without throttling it further, even under overall
high memory pressure.

> Since we noticed these errors, we've been searching for them on our
> logservers and found another case where this is occurirng using the
> same drbd version and infiniband.  This other case, however, uses a
> stock redhat kernel.
> 
> Has anyone else seen these errors?   Do the tuning recommendations for
> send/receive buffers need to be revisited? Would switching to rdp vs
> IPoIB be a potential fix here?  Thanks in advance for any insights.

I'm not sure about RDP in this context, I know it as
http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol
which is not available for DRBD.

SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol)
is available with DRBD 8.3.3 (rc*).
But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak
full pages.

As SDP support is a new feature in DRBD, I'd like to have feedback on
how well it works for you, and how performance as well as CPU usage
compares to IPoIB.

But correcting the tcp_mem setting above
is more likely to fix your symptoms.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed