[Drbd-dev] DRBD-8 - system hangs when NegDReply received

Wed Sep 6 00:02:07 CEST 2006

I think I've found the cause of the panic... got_NegDReply includes this
code:

	spin_lock(&mdev->pr_lock);
	list_del(&req->w.list);
	spin_unlock(&mdev->pr_lock);

I think it should actually be removing the request from the collision
list like receive_DataReply does:

	spin_lock(&mdev->pr_lock);
	hlist_del(&req->colision);
	spin_unlock(&mdev->pr_lock);

(the request shouldn't even be on a work queue at this time!) This would
explain why I get a crash later on walking the collision list...

As soon as I figure out why the request hangs in the first place, I'll
post a patch (but ideas are gladly received...) - adding the dec_xxx
calls didn't fix the problem so I'm confused

/simgr

> -----Original Message-----
> From: Graham, Simon
> Sent: Tuesday, September 05, 2006 4:47 PM
> To: Graham, Simon; drbd-dev at linbit.com
> Subject: RE: [Drbd-dev] DRBD-8 - system hangs when NegDReply received
> 
> In addition, a while after the system hung, it paniced with an invalid
> address in drbd_fail_pending_reads called from w_disconnect - panic
msg
> is attached (I assume this happened because things were stuck so
> PingAcks were not being sent so the partner system disconnected).
> 
> This looks to me like the temporary list of requests to complete setup
> in this routine is corrupted somehow.
> 
> Any suggestions?
> Simon
> 
> drbd0: State change from bad state. Error would be: 'Refusing to be
> Primary without at least one UpToDate disk'
> drbd0:  old = { cs:BrokenPipe st:Primary/Unknown ds:Diskless/DUnknown
> r--- }
> drbd0:  new = { cs:Unconnected st:Primary/Unknown ds:Diskless/DUnknown
> r--- }
>  [<c01050a1>] show_trace+0x21/0x30
>  [<c01051de>] dump_stack+0x1e/0x20
>  [<f1285e38>] _drbd_set_state+0xa08/0xa20 [drbd]
>  [<f127e393>] drbd_disconnect+0x223/0x310 [drbd]
>  [<f127ed18>] drbdd_init+0x78/0x120 [drbd]
>  [<f128681b>] drbd_thread_setup+0x6b/0xc0 [drbd]
>  [<c0102d9d>] kernel_thread_helper+0x5/0x18
> drbd0: No access to good data anymore.
> Unable to handle kernel paging request at virtual address 00100104
>  printing eip:
> f1277408
> *pde = ma 00000000 pa fffff000
> Oops: 0002 [#1]
> Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
> thermal processor fan button battery ac
> CPU:    0
> EIP:    0061:[<f1277408>]    Not tainted VLI
> EFLAGS: 00010297   (2.6.16.13-xen0 #1)
> EIP is at drbd_fail_pending_reads+0x78/0x240 [drbd]
> eax: 00100100   ebx: ec5eff60   ecx: ec883f48   edx: ec883f48
> esi: ef85fc00   edi: 00000002   ebp: ec883f5c   esp: ec883f34
> ds: 007b   es: 007b   ss: 0069
> Process drbd0_worker (pid: 4118, threadinfo=ec882000 task=efcdeab0)
> Stack: <0>e9dfae10 ee9f9520 fffffffb eca596a4 ef85fc00 ec5eff60
> e9dfae10 ef85fc00
>        ee9f9740 ef85fc38 ec883f7c f127760b ef85fc00 ef85fc38 ec883f7c
> c045bd8a
>        ee9f9740 ef85fc00 ec883fc0 f1278176 ef85fc00 ee9f9740 00000001
> 00000000
> Call Trace:
>  [<c010515a>] show_stack_log_lvl+0xaa/0xe0
>  [<c010536e>] show_registers+0x18e/0x210
>  [<c0105569>] die+0xd9/0x180
>  [<c0112ccc>] do_page_fault+0x3cc/0x68e
>  [<c0104d7f>] error_code+0x2b/0x30
>  [<f127760b>] w_disconnect+0x3b/0x2d0 [drbd]
>  [<f1278176>] drbd_worker+0x156/0x487 [drbd]
>  [<f128681b>] drbd_thread_setup+0x6b/0xc0 [drbd]
>  [<c0102d9d>] kernel_thread_helper+0x5/0x18
> Code: d2 75 e0 8b be 38 03 00 00 43 83 fb 0e 7e c4 31 c0 b9 0f 00 00
00
> f3 ab 8b 5d ec 8d 45 ec 39 c3 0f 84 ad 00 00 00 8b 53 04 8b 03 <89> 50
> 04 89 02 8b 53 18 b8 fb ff ff ff c7 03 00 01 10 00 c7 43
>  <0>Fatal exception: panic in 5 seconds