[DRBD-user] Kernel Oops, RH9.0, 2.4.20-19.9

Sat Jun 19 11:56:16 CEST 2004

On Fri, Jun 18, 2004 at 01:23:48PM -0700, Tim Hasson wrote:
> 
> Quoting Lars Ellenberg <Lars.Ellenberg at linbit.com>,
> on Mon, 14 Jun 2004 09:40:47 +0200:
> 
> > / 2004-06-13 21:39:38 -0700
> > \ Tim Hasson:
> > > Here's the results after I upgraded to kernel 2.4.26 from kernel.org (also
> > > recompiled drbd 0.6.12)
> > > 
> > > 
> > > 
> >                                                                              
> >  |
> > .--- slightly edited syslog:
> > | drbd: ===> drbd start <===
> > | drbd: modprobe -s drbd minor_count=2
> > | kernel: drbd: initialised. Version: 0.6.12 (api:64/proto:62)
> > | kernel: drbd0: Creating state file
> > | kernel: "/var/lib/drbd/drbd0"
> > | kernel: klogd 1.4.1, ---------- state change ----------
> > | kernel: drbd1: Creating state file
> > | kernel: "/var/lib/drbd/drbd1"
> > | kernel: drbd0: Connection established. size=35277516 KB / blksize=4096 B
> > | kernel: drbd1: Connection established. size=35277516 KB / blksize=4096 B
> > | drbd: drbdsetup /dev/nb1 wait_connect -t 0
> > | drbd: 'drbd0' SyncingAll, waiting for this to finish
> > | drbd: 'drbd1' SyncingAll, waiting for this to finish
> > | drbd: drbdsetup /dev/nb0 wait_sync
> > | drbd: drbdsetup /dev/nb1 wait_sync
> > | 
> > | about one minute later:
> > | 
> > | kernel: KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || \
> > | 		(flags&(MSG_PEEK|MSG_TRUNC))) failed at tcp.c(1603)
> > | last message repeated 11 times
> > | kernel: KERNEL: assertion (flags&MSG_PEEK) failed at tcp.c(1540)
> > | kernel: KERNEL: assertion (skb==NULL || before(tp->copied_seq, \
> > | 		TCP_SKB_CB(skb)->end_seq)) failed at tcp.c(1290)
> > | kernel: KERNEL: assertion (flags&MSG_PEEK) failed at tcp.c(1540)
> > | kernel: KERNEL: assertion (skb==NULL || before(tp->copied_seq, \
> > | 		TCP_SKB_CB(skb)->end_seq)) failed at tcp.c(1290)
> > | kernel: KERNEL: assertion (flags&MSG_PEEK) failed at tcp.c(1540)
> > | kernel: KERNEL: assertion (skb==NULL || before(tp->copied_seq, \
> > | 		TCP_SKB_CB(skb)->end_seq)) failed at tcp.c(1290)
> > | kernel: KERNEL: assertion (flags&MSG_PEEK) failed at tcp.c(1540)
> > | last message repeated 3 times
> > `---
> > 
> > > Any ideas?
> > 
> > your NIC is broken and cannot stand the load of a drbd full sync anymore.
> > I bet you can crash that box by just running a 
> >  netcat -l -p 7777 > /dev/null < /dev/null # on drbd2
> > and do a
> >  netcat drbd2 7777 < /dev/zero     # from drbd1 or any other box...
> > 
> > 
> > 	Lars Ellenberg
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> > 
> 
> 
> Well nc didn't crash it.
> 
> Replaced the gigabit nic (corresponding to the ip that's setup in drbd.conf)
> with a 10/100 nic. Locked up at 2% of the sync after drbd start
> 
> Tried disconnecting one of the two drives, same thing. Swapped it with the
> other
> drive, same thing.
> 
> 
> Here's the last output (after replacing the gigabit nic with a 10/100)
> 
> 
> Jun 14 15:51:59 drbd2 drbd: ===> drbd start <===
> Jun 14 15:51:59 drbd2 drbd: modprobe -s drbd minor_count=2
> Jun 14 15:51:59 drbd2 kernel: drbd: initialised. Version: 0.6.12
> (api:64/proto:62)
> Jun 14 15:51:59 drbd2 drbd: drbdsetup /dev/nb0 disk /dev/sdb1 --do-panic
> --disk-size=35277516k
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb0 net 192.168.0.87:7788
> 192.168.0.86:7788 C --sync-min=10M --sync-max=25M --sync-nice=-10
> --tl-size=5000 --timeo
> ut=60 --connect-int=10 --ping-int=10
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb1 disk /dev/sdc1 --do-panic
> --disk-size=35277516k
> Jun 14 15:52:00 drbd2 kernel: drbd1: Creating state file
> Jun 14 15:52:00 drbd2 kernel: "/var/lib/drbd/drbd1"
> Jun 14 15:52:00 drbd2 kernel: klogd 1.4.1, ---------- state change ----------
> Jun 14 15:52:00 drbd2 kernel: drbd0: Connection established. size=35277516 KB /
> blksize=4096 B
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb1 net 192.168.0.87:7789
> 192.168.0.86:7789 C --sync-min=10M --sync-max=25M --sync-nice=-10
> --tl-size=5000 --timeo
> ut=60 --connect-int=10 --ping-int=10
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb0 wait_connect -t 0
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb1 wait_connect -t 0
> Jun 14 15:52:00 drbd2 kernel: drbd1: Connection established. size=35277516 KB /
> blksize=4096 B
> Jun 14 15:52:00 drbd2 drbd: 'drbd0' SyncingAll, waiting for this to finish
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb0 wait_sync
> Jun 14 15:52:00 drbd2 drbd: 'drbd1' SyncingAll, waiting for this to finish
> Jun 14 15:52:00 drbd2 drbd: drbdsetup /dev/nb1 wait_sync
> Jun 14 15:52:41 drbd2 kernel: Unable to handle kernel NULL pointer
> dereference<1>Unable to handle kernel NULL pointer dereference at virtual

I already told you that posting undecoded oopses is useless.
reposting it does not help either :)

the drbd processes crash somewhere deep in the network stack,
because the *network* stack tries to dereference a null pointer.
at least thats what happened in the first decoded oops you sent, and I
just guess that it remains this very same problem.

it is not that drbd passes wrong arguments to the network stack.
these are tcp-internal housekeeping things that go wrong.
don't know what we can do about that. I not even know how one would go
about debugging this, since atm, you are the only one that is able to
reproduce the problem, and you told us that the same hardware was
working for three month without problems; and that you yourself suspect
a hardware failure. 
without new information, or some step-by-step to reproduce this on _any_
setup, or any other indication that this is really DRBDs fault, I see no
way to help you.

it may be hardware failure. it may be broken ram. it may be random
memory corruption. it may be overheating. it may be some broken kernl -
drbd interaction. it may even be a bug in drbd. but nothing in your post
is specific enough to tell.

sorry.

	Lars Ellenberg