Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> On Mon, Feb 09, 2009 at 12:28:18PM +0100, Ronald Moesbergen wrote: > > Hello! > > > > I'm using DRBD 8.0.14 on a Xen 3.3.1 x86_64 cluster for disk > > replication. Over the last few weeks the systems have been > crashing, > > particularly under load. I have captured the following using > > netconsole: > > > > Unable to handle kernel paging request at 0000180017006702 RIP: > > [<ffffffff80262ceb>] put_page+0x0/0x2e PGD 0 > > Oops: 0000 [1] SMP > > CPU 3 > > Modules linked in: netconsole ip_vs_wrr ip_vs xt_physdev > > iptable_filter ip_tables x_tables drbd bridge button ac battery > > ipmi_devintf ipmi_si ipmi_msghandler e1000e serio_raw pcsp$ > > Pid: 4044, comm: drbd1_receiver Tainted: GF 2.6.18.8-xen #1 > > RIP: e030:[<ffffffff80262ceb>] [<ffffffff80262ceb>] > put_page+0x0/0x2e > > RSP: e02b:ffff88026548bba8 EFLAGS: 00010246 > > RAX: 0000000000000000 RBX: ffff880192b24880 RCX: 0000000000000027 > > RDX: ffff88026e63d680 RSI: ffff8801b1fe8380 RDI: 0000180017006702 > > RBP: 0000000000000001 R08: 010100004600f501 R09: 0000000000000018 > > R10: ffffffff8049dd80 R11: ffff8802672ba8f8 R12: 0000000000000018 > > R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000000 > > FS: 00002ac6a4f7e010(0000) GS:ffffffff804d8180(0000) > > knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 > > Process drbd1_receiver (pid: 4044, threadinfo > ffff88026548a000, task > > ffff880265a017a0) > > Stack: ffffffff80395b3c ffff880192b24880 ffff880192b24880 > > ffff880192b24880 > > ffffffff80395911 ffff88016fe34d80 ffffffff803c35e0 > 0000410000000008 > > ffff88026548be40 0000001800000000 ffff88016fe35280 0000001800000000 > > Call Trace: > > [<ffffffff80395b3c>] skb_release_data+0x61/0x9c > [<ffffffff80395911>] > > kfree_skbmem+0x9/0x75 [<ffffffff803c35e0>] > tcp_recvmsg+0x72e/0xb05 > > [<ffffffff80392091>] sock_common_recvmsg+0x2d/0x42 > > [<ffffffff80392091>] sock_common_recvmsg+0x2d/0x42 > > [<ffffffff8038fee9>] sock_recvmsg+0x101/0x120 [<ffffffff8038fee9>] > > sock_recvmsg+0x101/0x120 [<ffffffff8024327a>] > > autoremove_wake_function+0x0/0x2e [<ffffffff802dd2d5>] > > generic_make_request+0x15f/0x174 [<ffffffff880c52bf>] > > :dm_mod:__map_bio+0x47/0x9b [<ffffffff8817cb26>] > > :drbd:drbd_recv+0x7b/0x109 [<ffffffff88180cd0>] > > :drbd:receive_DataRequest+0x72/0x575 > > [<ffffffff8817d4da>] :drbd:drbdd+0x77/0x151 [<ffffffff881801f2>] > > :drbd:drbdd_init+0xbe/0x1ab [<ffffffff88190440>] > > :drbd:drbd_thread_setup+0x11c/0x1c6 > > [<ffffffff8020af54>] child_rip+0xa/0x12 [<ffffffff88190324>] > > :drbd:drbd_thread_setup+0x0/0x1c6 [<ffffffff8020af4a>] > > child_rip+0x0/0x12 > > > > Code: 8b 07 f6 c4 40 74 05 e9 62 f9 ff ff 8b 47 08 85 c0 75 > 0a 0f RIP > > [<ffffffff80262ceb>] put_page+0x0/0x2e RSP <ffff88026548bba8> > > CR2: 0000180017006702 > > > > Since there are a lot of drbd related functions listed, I > suspect the > > problem originates there. The logging shows nothing > interesting around > > the time of the crash, just a spontaneous crash/reboot. The > > 'receiver1' thread belongs to a 'zabbix' DomU, which runs a mysql > > database that causes most of the io load. > > > > The excact DRBD version I use is: > > version: 8.0.14 (api:86/proto:86) > > GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by > > beheer at xen03, 2008-11-20 11:25:39 And the drbd.conf file is > attached. > > > > As you can see the kernel tainted flags say 'GF'. I have searched > > where this comes from, but can't find anything. We only run > > open-source stuff on these machines, so no scary binary > only modules. > > Also, a dmesg|grep -i 'tainted' returns nothing, and cat > > /proc/sys/kernel/tainted returns '0'. Go figure... My guess is that > > some startup script uses 'insmod -f' while it's not needed. > > then, how about grepping for that in those startup scripts? > or in the initramdisk/initramfs? I did but couldn't find anything. Also, after booting the machine and starting everything up, when I do a 'cat /proc/sys/kernel/tainted' it returns '0'. > > The kernel I'm using is here: > > http://xenbits.xensource.com/linux-2.6.18-xen.hg. It's > quite old, and > > I'd love to upgrade to something recent, but this is the > only kernel > > that can run as Dom0 with a recent version of Xen that I can find. > > > > My questions are: > > - Is this related to DRBD > > maybe "related". > if so, then probably only in a "garbage in, garbage out" sort of way. > > > or should I go and bug the xen guys? > > yes please, > and let us know how that goes. Ok, I will try that then. > > - Would upgrading to 8.2.x or even 8.3.x help? > > as I don't think DRBD does anything wrong in this case, I > don't think that upgrading will resolve this issue. > > the tcp stack, in tcp_recvmsg, thinks it should clean up a > bit of the skb memory, calls kfree_skbmem, skb_release_data, > which then finds some pages it no longer needs, calls > put_page on those. at which point dereferencing the page > pointer apparently causes the oops. > > drbd does not touch those page pointers, it only passes them on. > > and if it was an invalid page pointer when passing it to the > tcp stack, it would have exploded right then and there. > so it was a valid page pointer once, but is invalid now. > > from which I conclude that something causes memory corruption > on your box. > > a kernel module that was forced to be loaded, despite not > matching the kernel, might do such thing. > as would a number of other things, including bad hardware. Interesting. Since both boxes crash now and then, I think we can rule out bad hardware. It must be the ancient xen kernel that bugs out on me again. Thanks for your help! > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user