[DRBD-user] Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband

Tue Sep 22 23:01:00 CEST 2009

On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:

> On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote:
>> Hello all,
>>
>> We're experiencing page allocation errors when writing to a connected
>> drbd device (connected via infiniband using IPoIB) that resulted in a
>> kernel panic last night.
>>
>> When rsyncing data to the primary node, we get a slew of these
>> errors in /var/log/messages:
>>
>> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation
>> failure. order:5, mode:0x10
>> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker
>> Tainted: P          2.6.27.25 #2
>> Sep 21 22:11:21 client-nfs5 kernel:
>> Sep 21 22:11:21 client-nfs5 kernel: Call Trace:
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8026eec4>]
>> __alloc_pages_internal+0x3a4/0x3c0
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028c61a>]
>> kmem_getpages+0x6b/0x12b
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028cb93>]
>> fallback_alloc+0x11d/0x1b1
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8028c816>]
>> kmem_cache_alloc_node+0xa3/0xcf
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff803ecf65>]
>> __alloc_skb+0x64/0x12e
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8041af6f>]
>> sk_stream_alloc_skb+0x2f/0xd5
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff8041beb0>]
>> tcp_sendmsg+0x180/0x9d1
>> Sep 21 22:11:21 client-nfs5 kernel:  [<ffffffff803e6d85>]
>> sock_sendmsg+0xe2/0xff
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8027464b>]
>> zone_statistics+0x3a/0x5d
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff803e82f6>]
>> kernel_sendmsg+0x2c/0x3e
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0568d83>] drbd_send
>> +0xb2/0x194 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569134>]
>> _drbd_send_cmd+0x9c/0x116 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569285>]
>> send_bitmap_rle_or_plain+0xd7/0x13a [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa0569475>]
>> _drbd_send_bitmap+0x18d/0x1ae [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056c242>]
>> drbd_send_bitmap+0x39/0x4c [drbd]
>
> I certainly would not expect to _cause_
> memory pressure from this call path.
> But someone is causing it, and we are affected.
>
>
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056b9a1>]
>> w_bitmap_io+0x45/0x95 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa05560c4>]
>> drbd_worker+0x230/0x3eb [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056b0d9>]
>> drbd_thread_setup+0x124/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8020cd59>] child_rip
>> +0xa/0x11
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffffa056afb5>]
>> drbd_thread_setup+0x0/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel:  [<ffffffff8020cd4f>] child_rip
>> +0x0/0x11
>
> why is the tcp stack trying to allocate order:5 pages?
>
> thats 32 continguous pages, 128 KiB continguous memory.  apparently
> your
> memory is so much fragmented that this is no longer available.
> but why is the tcp stack ok with order:0 or order:1 pages?
> why does it have to be order:5 ?
>
>> These were occurring for hours until a kernel panic:
>>
>> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for
>> more than 120 seconds.
>> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/
>> hung_task_timeout_secs" disables this message.
>> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046
>> ffff88062e801cf8 ffffffffa056b821
>> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120
>> ffff88063e668750 ffff88063dd05478
>> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d
>> ffffffffffffffff ffffffffffffffff
>> [Mon Sep 21 22:11:33 2009]Call Trace:
>> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn
>> +0x14a/0x1aa [drbd]
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit
>> +0x40/0x6f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>]
>> out_of_line_wait_on_bit+0x6c/0x78
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function
>> +0x0/0x23
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>]
>> journal_commit_transaction+0x7cd/0xc4d [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base
>> +0x26/0x4b
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald
>> +0xc1/0x1fb [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb
>> [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11
>> [Mon Sep 21 22:11:34 2009]
>> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup:
>> blocked tasks
>> [-- root at localhost.localdomain attached -- Mon Sep 21 22:17:24 2009]
>> [-- root at localhost.localdomain detached -- Mon Sep 21 23:16:22 2009]
>> [-- Console down -- Tue Sep 22 04:02:15 2009]
>
> That is unfortunate.
>
>> The two systems are hardware and OS identical:
>>
>> [root at client-nfs5 log]# uname -a
>> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root at client-nfs5 log]# grep model\ name /proc/cpuinfo
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>> model name    : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>>
>> [root at client-nfs5 log]# free -m
>>             total       used       free     shared    buffers
>> cached
>> Mem:         24110       2050      22060          0
>> 157       1270
>> -/+ buffers/cache:        622      23488
>> Swap:         4094          0       4094
>>
>> [root at client-nfs5 log]# cat /proc/partitions
>> major minor  #blocks  name
>>
>>   8     0 1755310080 sda
>>   8     1 1755302031 sda1
>>   8    16 4881116160 sdb
>>   8    17 4881116126 sdb1
>>   8    32   62522712 sdc
>>   8    33   58147236 sdc1
>>   8    34    4192965 sdc2
>> 147     0 1755248424 drbd0
>> 147     1 4880967128 drbd1
>>
>> [root at client-nfs5 log]# modinfo drbd
>> filename:       /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko
>> alias:          block-major-147-*
>> license:        GPL
>> version:        8.3.3rc1
>
> Just in case, please retry with current git.
> But it is more likely an issue with badly tuned tcp, or tcp via IPoIB.
>
>> description:    drbd - Distributed Replicated Block Device v8.3.3rc1
>> author:         Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com
>> >
>> srcversion:     A06488A2009E5EB94AF0825
>> depends:
>> vermagic:       2.6.27.25 SMP mod_unload modversions
>> parm:           minor_count:Maximum number of drbd devices (1-255)
>> (uint)
>> parm:           disable_sendpage:bool
>> parm:           allow_oos:DONT USE! (bool)
>> parm:           cn_idx:uint
>> parm:           proc_details:int
>> parm:           enable_faults:int
>> parm:           fault_rate:int
>> parm:           fault_count:int
>> parm:           fault_devs:int
>> parm:           usermode_helper:string
>>
>> [root at client-nfs5 log]# cat /etc/drbd.conf
>> global {
>>  usage-count no;
>> }
>>
>> common {
>>  syncer { rate 800M; }  #This is temporary
>
> please reduce, and only slowly increase again, until the cause of the
> high memory pressure is identified.
> maybe someone is leaking full pages?
> with infiniband SDP, that in fact did happen.
>
> though, if there was no current resync going on,
> this cannot be the cause.
>
>>  protocol C;
>>
>>  handlers {
>>  }
>>
>>  startup {    degr-wfc-timeout        60;
>>    wfc-timeout             60;
>>  }
>>
>>  disk {
>>    on-io-error     detach;
>>    fencing         dont-care;
>>    no-disk-flushes;
>>    no-md-flushes;
>>    no-disk-barrier;
>>  }
>>
>>  net {
>>    ko-count        2;
>>    after-sb-2pri   disconnect;
>>    max-buffers 16000;
>>    max-epoch-size 16000;
>>    unplug-watermark 16001;
>>    sndbuf-size 8m;
>>    rcvbuf-size 8m;
>>
>>  }
>> }
>>
>> resource drbd0 {
>>  on client-nfs5 {
>>    device  /dev/drbd0;
>>    disk    /dev/sda1;
>>    address  192.168.16.48:7789;
>>    flexible-meta-disk internal;
>>  }
>>
>>  on client-nfs6 {
>>    device  /dev/drbd0;
>>    disk    /dev/sda1;
>>    address  192.168.16.49:7789;
>>    flexible-meta-disk internal;
>>  }
>> }
>>
>> resource drbd1 {
>>  on client-nfs5 {
>>    device  /dev/drbd1;
>>    disk    /dev/sdb1;
>>    address  192.168.16.48:7790;
>>    flexible-meta-disk internal;
>>  }
>>
>>  on client-nfs6 {
>>    device  /dev/drbd1;
>>    disk    /dev/sdb1;
>>    address  192.168.16.49:7790;
>>    flexible-meta-disk internal;
>>  }
>> }
>>
>> [root at client-nfs5 log]# ibstat
>> CA 'mlx4_0'
>>      CA type: MT26428
>>      Number of ports: 2
>>      Firmware version: 2.6.100
>>      Hardware version: a0
>>      Node GUID: 0x0002c9030004a79a
>>      System image GUID: 0x0002c9030004a79d
>>      Port 1:
>>              State: Active
>>              Physical state: LinkUp
>>              Rate: 40
>>              Base lid: 1
>>              LMC: 0
>>              SM lid: 1
>>              Capability mask: 0x0251086a
>>              Port GUID: 0x0002c9030004a79b
>>      Port 2:
>>              State: Active
>>              Physical state: LinkUp
>>              Rate: 40
>>              Base lid: 1
>>              LMC: 0
>>              SM lid: 1
>>              Capability mask: 0x0251086a
>>              Port GUID: 0x0002c9030004a79c
>>
>> [root at client-nfs5 log]# cat /etc/sysctl.conf
>> # Kernel sysctl configuration file for Red Hat Linux
>> #
>> # For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
>> # sysctl.conf(5) for more details.
>>
>> # Controls IP packet forwarding
>> net.ipv4.ip_forward = 0
>>
>> # Controls source route verification
>> net.ipv4.conf.default.rp_filter = 1
>>
>> # Do not accept source routing
>> net.ipv4.conf.default.accept_source_route = 0
>>
>> # Controls the System Request debugging functionality of the kernel
>> kernel.sysrq = 0
>>
>> # Controls whether core dumps will append the PID to the core
>> filename
>> # Useful for debugging multi-threaded applications
>> kernel.core_uses_pid = 1
>>
>> # Controls the use of TCP syncookies
>> net.ipv4.tcp_syncookies = 1
>>
>> # Controls the maximum size of a message, in bytes
>> kernel.msgmnb = 65536
>>
>> # Controls the default maxmimum size of a mesage queue
>> kernel.msgmax = 65536
>>
>> # Controls the maximum shared segment size, in bytes
>> kernel.shmmax = 68719476736
>>
>> # Controls the maximum number of shared memory segments, in pages
>> kernel.shmall = 4294967296
>> net.ipv4.tcp_timestamps=0
>> net.ipv4.tcp_sack=0
>> net.core.netdev_max_backlog=250000
>> net.core.rmem_max=16777216
>> net.core.wmem_max=16777216
>> net.core.rmem_default=16777216
>> net.core.wmem_default=16777216
>> net.core.optmem_max=16777216
>
>
> this:
>
>> net.ipv4.tcp_mem=16777216 16777216 16777216
>
> ^^ is nonsense.
> tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max
> buffer
> size you apparently want to set.
>
> I know the first hits on google for tcp_mem sysctl suggest otherwise.
> They are misguided and wrong.
>
> tcp_mem controlls the watermarks for _global_ tcp stack memory usage,
> where no pressure, little pressure, or high pressure is used against
> the tcp stack, if it tries to allocate some more.
>
> what you say there is that you do not bother to throttle the tcp stack
> memory usage at all, until its total usage would reach about
> 16777216 * 4k, which is 64GiB, which is more than double the amount of
> total ram in your box.
>
> DO NOT DO THAT.
>
> Or your box will run into hard OOM conditions,
> invoke the oom-killer, and eventually will panic.
>
> Set these values to e.g. 10%, 15%, 30% of total RAM,
> and remember the unit is _pages_ (4k).
>
>> net.ipv4.tcp_rmem=4096 87380 16777216
>> net.ipv4.tcp_wmem=4096 65536 16777216
>
> here, however, I suggest to up the lower mark to about 32k or 64k,
> which is the "guaranteed" amount of buffer space a tcp buffer is
> promissed to be left without throttling it further, even under overall
> high memory pressure.

Ok, this makes sense.  These settings were taken from the ofed 1.4.2
release notes and are actually set dynamically when openibd is started
(via /sbin/ib_ipoib_sysctl)...

We'll start on these, then.

>
>> Since we noticed these errors, we've been searching for them on our
>> logservers and found another case where this is occurirng using the
>> same drbd version and infiniband.  This other case, however, uses a
>> stock redhat kernel.
>>
>> Has anyone else seen these errors?   Do the tuning recommendations
>> for
>> send/receive buffers need to be revisited? Would switching to rdp vs
>> IPoIB be a potential fix here?  Thanks in advance for any insights.
>
> I'm not sure about RDP in this context, I know it as
> http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol
> which is not available for DRBD.

The context here is my bad typing.  I meant SDP.

>
> SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol)
> is available with DRBD 8.3.3 (rc*).
> But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak
> full pages.

In our testing, we've been using the SDP from OFED 1.4.2.

>
> As SDP support is a new feature in DRBD, I'd like to have feedback on
> how well it works for you, and how performance as well as CPU usage
> compares to IPoIB.
>
> But correcting the tcp_mem setting above
> is more likely to fix your symptoms.

I suspect it will.  We'll test and follow up.

Much thanks for the quick reply.

>
> --
> : Lars Ellenberg
> : LINBIT HA-Solutions GmbH
> : DRBD®/HA support and consulting    http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

--
Jason McKay
Sr. Engineer
Logicworks