Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Aug 13, 2008 at 03:00:52PM -0700, Chris Miller wrote: > > I've got a pair of HA servers I'm trying to get into production. > Here are some specs : > > Xeon X3210 Quad Core (aka Core 2 Quad) 2.13Ghz (four logical > processors, no Hyper Threading) > 4GB memory > Hardware (3ware) Raid 1 mirror, 2 x Seagate 750GB SATA2 > 650GB DRBD partition run on top of an LVM2 partition. > > CentOS 5.2 2.6.18-92.1.6.el5.centos.plus > DRBD 8.2 (drbd82-8.2.6-1.el5.centos) > Kernel Module kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus > > I've been trying to rsync data from a remote server and it's crashed > a couple of times now. It does not happen immediately, but over > time. I connected a serial console and got the below panic message. > The last file copied was ~1GB in size, but previous files up to 4GB > had been copied. I do not have kernel core dumping enabled, but > that's a possibility if needed. Not sure if this is a bug or is > caused by something I've done. This isn't my first DRBD install > (although first on top of LVM) and I believe I've gotten everything > setup correctly. I did have a full sync rate (110M) enabled over > Gbe, if that's relevant. Thoughts? > > Regards, > Chris > [root at haws1 ~]# BUG: unable to handle kernel paging request at > virtual address c > printing eip: > c04e9291 > *pde = 00000000 > Oops: 0000 [#1] > SMP > last sysfs file: /devices/pci0000:00/0000:00:00.0/irq > Modules linked in: softdog drbd(U) autofs4 hidp rfcomm l2cap > bluetooth sunrpc id > CPU: 0 > EIP: 0060:[<c04e9291>] Tainted: G VLI > EFLAGS: 00010046 (2.6.18-92.1.6.el5.centos.plus #1) > EIP is at list_del+0x25/0x5c in case this Oops can be trusted, this list_del aparently dereferences a NULL pointer. > eax: fe187128 ebx: f04a6ab8 ecx: f04a6a8c edx: f04a6a8c > esi: fe187128 edi: f4e355a0 ebp: f426c800 esp: f385df3c > ds: 007b es: 007b ss: 0068 > Process drbd0_asender (pid: 2900, ti=f385d000 task=f4932000 > task.ti=f385d000) > Stack: 000000e6 f8d1953b 00000000 f04a6a8c 000000e6 00000001 ee187b14 00000046 > f49e7bc0 f04a6ab8 f04a6a8c f426c800 fe187128 f4e355a0 0000349f f8d24805 > 00000800 f426c800 f426c800 00000008 f426c9f4 f8d14d47 f385dfbc f8d15fbc > Call Trace: > [<f8d1953b>] _req_may_be_done+0x4ea/0x710 [drbd] and according to this stack trace, it is the "list_del(&req->tl_requests)" in _req_is_done() this list is protected by the req_lock spinlock. "tl" is short for "transfer log", which is the main housekeeping list structure we have for replication requests, so it is modified all the time. If we had a list corruption bug there, someone else should have noticed. I have no idea how a NULL pointer could get there. try to reproduce with a differnt kernel or with different hardware. > [<f8d24805>] tl_release+0x35/0x172 [drbd] > [<f8d14d47>] got_BarrierAck+0x10/0x6b [drbd] > [<f8d15fbc>] drbd_asender+0x3b1/0x4e7 [drbd] > [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] > [<f8d24adb>] drbd_thread_setup+0x88/0x14e [drbd] > [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] > [<c0405c3b>] kernel_thread_helper+0x7/0x10 > ======================= > Code: 89 c3 eb eb 90 90 53 89 c3 8b 40 04 8b 00 39 d8 74 17 50 53 68 > 9b 9a 63 c -- : Lars Ellenberg : LINBIT HA-Solutions GmbH : DRBD®/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT Information Technologies GmbH __ please don't Cc me, but send to list -- I'm subscribed