[DRBD-user] DRBD 8.2 crashes CentOS 5.2 on rsync from remote host

Sun Aug 17 18:51:48 CEST 2008

On Wed, Aug 13, 2008 at 03:00:52PM -0700, Chris Miller wrote:
>
> I've got a pair of HA servers I'm trying to get into production.
> Here are some specs :
>
> Xeon X3210 Quad Core (aka Core 2 Quad) 2.13Ghz (four logical
> processors, no Hyper Threading)
> 4GB memory
> Hardware (3ware) Raid 1 mirror, 2 x Seagate 750GB SATA2
> 650GB DRBD partition run on top of an LVM2 partition.
>
> CentOS 5.2 2.6.18-92.1.6.el5.centos.plus
> DRBD 8.2 (drbd82-8.2.6-1.el5.centos)
> Kernel Module kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus
>
> I've been trying to rsync data from a remote server and it's crashed
> a couple of times now. It does not happen immediately, but over
> time. I connected a serial console and got the below panic message.
> The last file copied was ~1GB in size, but previous files up to 4GB
> had been copied. I do not have kernel core dumping enabled, but
> that's a possibility if needed. Not sure if this is a bug or is
> caused by something I've done. This isn't my first DRBD install
> (although first on top of LVM) and I believe I've gotten everything
> setup correctly. I did have a full sync rate (110M) enabled over
> Gbe, if that's relevant. Thoughts?
>
> Regards,
> 	Chris

> [root at haws1 ~]# BUG: unable to handle kernel paging request at
> virtual address c
>  printing eip:
> c04e9291
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
> Modules linked in: softdog drbd(U) autofs4 hidp rfcomm l2cap
> bluetooth sunrpc id
> CPU:    0
> EIP:    0060:[<c04e9291>]    Tainted: G      VLI
> EFLAGS: 00010046   (2.6.18-92.1.6.el5.centos.plus #1)

> EIP is at list_del+0x25/0x5c 

in case this Oops can be trusted,
this list_del aparently dereferences a NULL pointer.

> eax: fe187128   ebx: f04a6ab8   ecx: f04a6a8c   edx: f04a6a8c
> esi: fe187128   edi: f4e355a0   ebp: f426c800   esp: f385df3c
> ds: 007b   es: 007b   ss: 0068
> Process drbd0_asender (pid: 2900, ti=f385d000 task=f4932000
> task.ti=f385d000)
> Stack: 000000e6 f8d1953b 00000000 f04a6a8c 000000e6 00000001 ee187b14 00000046
>        f49e7bc0 f04a6ab8 f04a6a8c f426c800 fe187128 f4e355a0 0000349f f8d24805
>        00000800 f426c800 f426c800 00000008 f426c9f4 f8d14d47 f385dfbc f8d15fbc
> Call Trace:

>  [<f8d1953b>] _req_may_be_done+0x4ea/0x710 [drbd]

and according to this stack trace,
it is the "list_del(&req->tl_requests)" in _req_is_done()

this list is protected by the req_lock  spinlock.

"tl" is short for "transfer log", which is the main housekeeping list
structure we have for replication requests, so it is modified all the
time.  If we had a list corruption bug there, someone else should have
noticed.  I have no idea how a NULL pointer could get there.

try to reproduce with a differnt kernel
or with different hardware.

>  [<f8d24805>] tl_release+0x35/0x172 [drbd]
>  [<f8d14d47>] got_BarrierAck+0x10/0x6b [drbd]
>  [<f8d15fbc>] drbd_asender+0x3b1/0x4e7 [drbd]
>  [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd]
>  [<f8d24adb>] drbd_thread_setup+0x88/0x14e [drbd]
>  [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd]
>  [<c0405c3b>] kernel_thread_helper+0x7/0x10
>  =======================
> Code: 89 c3 eb eb 90 90 53 89 c3 8b 40 04 8b 00 39 d8 74 17 50 53 68
> 9b 9a 63 c

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed