Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Feb 09, 2009 at 12:28:18PM +0100, Ronald Moesbergen wrote: > Hello! > > I'm using DRBD 8.0.14 on a Xen 3.3.1 x86_64 cluster for disk > replication. Over the last few weeks the systems have been crashing, > particularly under load. I have captured the following using > netconsole: > > Unable to handle kernel paging request at 0000180017006702 RIP: > [<ffffffff80262ceb>] put_page+0x0/0x2e > PGD 0 > Oops: 0000 [1] SMP > CPU 3 > Modules linked in: netconsole ip_vs_wrr ip_vs xt_physdev > iptable_filter ip_tables x_tables drbd bridge button ac battery > ipmi_devintf ipmi_si ipmi_msghandler e1000e serio_raw pcsp$ > Pid: 4044, comm: drbd1_receiver Tainted: GF 2.6.18.8-xen #1 > RIP: e030:[<ffffffff80262ceb>] [<ffffffff80262ceb>] put_page+0x0/0x2e > RSP: e02b:ffff88026548bba8 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ffff880192b24880 RCX: 0000000000000027 > RDX: ffff88026e63d680 RSI: ffff8801b1fe8380 RDI: 0000180017006702 > RBP: 0000000000000001 R08: 010100004600f501 R09: 0000000000000018 > R10: ffffffff8049dd80 R11: ffff8802672ba8f8 R12: 0000000000000018 > R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000000 > FS: 00002ac6a4f7e010(0000) GS:ffffffff804d8180(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 > Process drbd1_receiver (pid: 4044, threadinfo ffff88026548a000, task ffff880265a017a0) > Stack: ffffffff80395b3c ffff880192b24880 ffff880192b24880 ffff880192b24880 > ffffffff80395911 ffff88016fe34d80 ffffffff803c35e0 0000410000000008 > ffff88026548be40 0000001800000000 ffff88016fe35280 0000001800000000 > Call Trace: > [<ffffffff80395b3c>] skb_release_data+0x61/0x9c > [<ffffffff80395911>] kfree_skbmem+0x9/0x75 > [<ffffffff803c35e0>] tcp_recvmsg+0x72e/0xb05 > [<ffffffff80392091>] sock_common_recvmsg+0x2d/0x42 > [<ffffffff80392091>] sock_common_recvmsg+0x2d/0x42 > [<ffffffff8038fee9>] sock_recvmsg+0x101/0x120 > [<ffffffff8038fee9>] sock_recvmsg+0x101/0x120 > [<ffffffff8024327a>] autoremove_wake_function+0x0/0x2e > [<ffffffff802dd2d5>] generic_make_request+0x15f/0x174 > [<ffffffff880c52bf>] :dm_mod:__map_bio+0x47/0x9b > [<ffffffff8817cb26>] :drbd:drbd_recv+0x7b/0x109 > [<ffffffff88180cd0>] :drbd:receive_DataRequest+0x72/0x575 > [<ffffffff8817d4da>] :drbd:drbdd+0x77/0x151 > [<ffffffff881801f2>] :drbd:drbdd_init+0xbe/0x1ab > [<ffffffff88190440>] :drbd:drbd_thread_setup+0x11c/0x1c6 > [<ffffffff8020af54>] child_rip+0xa/0x12 > [<ffffffff88190324>] :drbd:drbd_thread_setup+0x0/0x1c6 > [<ffffffff8020af4a>] child_rip+0x0/0x12 > > Code: 8b 07 f6 c4 40 74 05 e9 62 f9 ff ff 8b 47 08 85 c0 75 0a 0f > RIP [<ffffffff80262ceb>] put_page+0x0/0x2e > RSP <ffff88026548bba8> > CR2: 0000180017006702 > > Since there are a lot of drbd related functions listed, I suspect the > problem originates there. The logging shows nothing interesting around > the time of the crash, just a spontaneous crash/reboot. The > 'receiver1' thread belongs to a 'zabbix' DomU, which runs a mysql > database that causes most of the io load. > > The excact DRBD version I use is: > version: 8.0.14 (api:86/proto:86) > GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by beheer at xen03, 2008-11-20 11:25:39 > And the drbd.conf file is attached. > > As you can see the kernel tainted flags say 'GF'. I have searched > where this comes from, but can't find anything. We only run > open-source stuff on these machines, so no scary binary only modules. > Also, a dmesg|grep -i 'tainted' returns nothing, and cat > /proc/sys/kernel/tainted returns '0'. Go figure... My guess is that > some startup script uses 'insmod -f' while it's not needed. then, how about grepping for that in those startup scripts? or in the initramdisk/initramfs? > The kernel I'm using is here: > http://xenbits.xensource.com/linux-2.6.18-xen.hg. It's quite old, and > I'd love to upgrade to something recent, but this is the only kernel > that can run as Dom0 with a recent version of Xen that I can find. > > My questions are: > - Is this related to DRBD maybe "related". if so, then probably only in a "garbage in, garbage out" sort of way. > or should I go and bug the xen guys? yes please, and let us know how that goes. > - Would upgrading to 8.2.x or even 8.3.x help? as I don't think DRBD does anything wrong in this case, I don't think that upgrading will resolve this issue. the tcp stack, in tcp_recvmsg, thinks it should clean up a bit of the skb memory, calls kfree_skbmem, skb_release_data, which then finds some pages it no longer needs, calls put_page on those. at which point dereferencing the page pointer apparently causes the oops. drbd does not touch those page pointers, it only passes them on. and if it was an invalid page pointer when passing it to the tcp stack, it would have exploded right then and there. so it was a valid page pointer once, but is invalid now. from which I conclude that something causes memory corruption on your box. a kernel module that was forced to be loaded, despite not matching the kernel, might do such thing. as would a number of other things, including bad hardware. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed