[DRBD-user] 0.7-pre7-2004/06/01 oops

Sat Jun 5 10:11:14 CEST 2004

/ 2004-06-05 01:34:54 -0400
\ T. Howell-Cintron:
> 
> I'm running 0.7pre7 2004/06/01.  The (initial) primary is an FC2 2.6.5
> and runs fine, save for eventually noting that the connection to the
> secondary was lost.  The secondary is a FC1 2.4.22 machine and loads the
> module fine, but shortly after starting to sync I get several oops (see
> attached).

we did not test drbd 0.7 on 2.4 kernel in a while,
only made sure it still compiles.
it is likely that there is some strange
typo/misconcepting in the 2.4 compat wrappers.

in case I have time, or it occurs to me that its obvious,
I'll dig into this, but don'T expect a patch too soon.

meanwhile, please use 2.6 kernels only.

> #
> # initial module load, sync, etc.
> #
> 
> drbd: initialised. Version: 0.7-pre7 cvs $Date: 2004/06/01 07:33:19 $ (api:73/proto:72)
> drbd0: size = 59713605 KB
> drbd0: 56879176 KB marked out-of-sync by on disk bit-map.
> drbd0: No usable activity log found.
> drbd0: Connection established.
> drbd0: Peer switched to Primary state
> drbd0: Resync started as target (need to sync 56879173 KB).
> Unable to handle kernel NULL pointer dereference at virtual address 00000004
>  printing eip:
> d08f85f7
> *pde = 0b9a1067
> *pte = 00000000
> Oops: 0002
> drbd parport_pc lp parport 3c59x keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod scsi_mod  
> CPU:    0
> EIP:    0060:[<d08f85f7>]    Not tainted
> EFLAGS: 00010086
> 
> EIP is at finish_wait [drbd] 0x27 (2.4.22-1.2188.nptl)

aha.
it just occured that its obvious.
first, your kernel already contains a backport of those functions,
second, our finish_wait backport was wrong.

ok, this should be fixed by this, which should be in cvs soonish...

diff -u -p -r1.97.2.165 drbd_receiver.c
=======================

--- drbd_receiver.c	1 Jun 2004 14:29:07 -0000	1.97.2.165
+++ drbd_receiver.c	5 Jun 2004 07:54:52 -0000
@@ -248,11 +248,11 @@ STATIC void prepare_to_wait(wait_queue_h
 {
 	unsigned long flags;
 
+	__set_current_state(state);
 	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
 		__add_wait_queue(q, wait);
-	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 
@@ -261,10 +261,11 @@ STATIC void finish_wait(wait_queue_head_
 	unsigned long flags;
 
 	__set_current_state(TASK_RUNNING);
-
-	spin_lock_irqsave(&q->lock, flags);
-	list_del_init(&wait->task_list);
-	spin_unlock_irqrestore(&q->lock, flags);
+	if (!list_empty(&wait->task_list)) {
+		spin_lock_irqsave(&q->lock, flags);
+		list_del_init(&wait->task_list);
+		spin_unlock_irqrestore(&q->lock, flags);
+	}
 }
 
 #define DEFINE_WAIT(name)	DECLARE_WAITQUEUE(name,current)
=======================


> #
> # after running 'drbdadm disconnect..'
> #
>  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000004
>  printing eip:
> c012731f
> *pde = 00000000
> Oops: 0000
> drbd parport_pc lp parport 3c59x keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod scsi_mod
> CPU:    0
> EIP:    0060:[<c012731f>]    Not tainted
> EFLAGS: 00010006
>  
> EIP is at force_sig_info [kernel] 0x2f (2.4.22-1.2188.nptl)
> eax: 00000014   ebx: cfa47b9c   ecx: cb4c4000   edx: 00000000
> esi: cfa47800   edi: 00000001   ebp: cfa47940   esp: cb0abed0
> ds: 0068   es: 0068   ss: 0068
> Process drbdsetup (pid: 2711, stackpage=cb0ab000)
> Stack: 00002b00 00000001 00000001 00000282 cfa47b9c cfa47800 00000001 cfa47940
>        c0127bef 00000001 00000001 cb4c4000 d08f1d0d 00000001 cb4c4000 d08f19db
>        00000000 c01452f4 00000296 00000000 cfa47800 d08f0a27 cfa47b9c 00000000
> Call Trace:   [<c0127bef>] force_sig [kernel] 0x1f (0xcb0abef0)

hm. this is not that easy.
we don't register signal handlers, but utilize force_sig nevertheless.
RH 2.4 force_sig_info seems to not like this, and chokes somewhere in
kernel/signal.c ...
sorry, don't see the exact cause now, and no quick fix either.


	Lars Ellenberg