[DRBD-user] Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband

Lars Ellenberg lars.ellenberg at linbit.com
Mon Oct 5 11:49:11 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Oct 05, 2009 at 10:41:23AM +0200, Lars Ellenberg wrote:
> On Sun, Oct 04, 2009 at 10:14:22PM +0200, Lars Ellenberg wrote:
> > On Sun, Oct 04, 2009 at 03:55:44AM -0400, Gennadiy Nerubayev wrote:
> > > On Tue, Sep 22, 2009 at 5:01 PM, Jason McKay <jmckay at logicworks.net> wrote:
> > > 
> > > > On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:
> > > >
> > > > > But correcting the tcp_mem setting above
> > > > > is more likely to fix your symptoms.
> > > >
> > > > I suspect it will.  We'll test and follow up.
> > > >
> > > 
> > > Hi guys,
> > > 
> > > Unfortunately these are still occurring, even after we've updated to rc3,
> > > and used the tuning settings from rc3 notes (prior to this % of memory in
> > > pages were attempted with same results). They are a lot less frequent
> > > (intervals measured in hours), and have not yet caused a panic, but of
> > > course the worry is that it may happen regardless. Anything else that we
> > > could try here to eliminate it completely? Is there any chance that the
> > > ipoib stack is at fault?
> > 
> > Possibly.
> > Maybe Vlad knows more?
> >  From http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/12_01_08.txt:
> > 1419 	maj 	vlad at mellanox 	Iperf-2.0.4 fails: page allocation failure. order:5
> > I guess that means https://bugs.openfabrics.org/show_bug.cgi?id=1419
> > Not much progress on that bug, though.
> 
> This appears related, as well:
> http://bugzilla.kernel.org/show_bug.cgi?id=10890
> 
> Though there it was claimed that leaving network sysctls at the defaults
> "solved" the issue.

And yet one more, where sysctls helped:
http://thread.gmane.org/gmane.linux.nfs/20761/focus=695707

It has different context, but that thread may give you an idea on
how to track it down further: turn on slab debug,
then sample /proc/slabinfo, /proc/slab_allocators, /proc/net/sockstat,
and maybe similar statistics in the infiniband area.

BTW, maybe your netdev_max_backlog is a bit excessive?


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.



More information about the drbd-user mailing list