<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.5730.11" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>This one is
automatic. During a read an IO errors occured. The specific error is
ATA stat/err 0x51/40 translated into SCSI error 0x3/11/04 or Medium error/Read
error/auto-reallocation failed. This error appears to be a ATA
Uncorectable ECC error...So though the read may have
completed,</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>the data is not
validated. This is just for completeness in helping to understand this.
The main issue is that drbd ends up with a Null lc. Here
is</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>the stack
trace. Also see Simon Graham very good analysis after:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=973544613-31012007></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT
face="Times New Roman" size=3>Jan 30 11:05:46 bo kernel: ------------[ cut here
]------------<BR>Jan 30 11:05:46 bo kernel: kernel BUG at
/sandbox/emontros/devel/trunk_drbd8/platform/drbd/src/drbd/lru_cache.c:120!<BR>Jan
30 11:05:46 bo kernel: invalid opcode: 0000 [#1]<BR>Jan 30 11:05:46 bo kernel:
SMP<BR>Jan 30 11:05:46 bo kernel: Modules linked in: drbd cn bridge ipv6
ipmi_devintf ipmi_si ipmi_msghandler binfmt_misc dm_mirror video thermal
processor fan container button battery ac hw_random i2c_i801 i2c_core shpchp
pci_hotplug e1000 piix ide_cd cdrom raid1 dm_mod ide_disk ata_piix libata sd_mod
scsi_mod<BR>Jan 30 11:05:46 bo kernel: CPU: 0<BR>Jan 30 11:05:46 bo
kernel: EIP: 0061:[<ee3ad1c4>] Tainted: GF
VLI<BR>Jan 30 11:05:46 bo kernel: EFLAGS: 00010046 (2.6.16.29-xen
#1)<BR>Jan 30 11:05:46 bo kernel: EIP is at lc_find+0x44/0x50 [drbd]<BR>Jan 30
11:05:46 bo kernel: eax: 00000000 ebx: 00000000 ecx: ec8e13b0
edx: 00000058<BR>Jan 30 11:05:46 bo kernel: esi: 00000058 edi:
ec8e13b0 ebp: c59b9f08 esp: c59b9f00<BR>Jan 30 11:05:46 bo kernel:
ds: 007b es: 007b ss: 0069<BR>Jan 30 11:05:46 bo kernel: Process
drbd1_worker (pid: 6253, threadinfo=c59b8000 task=c586e570)<BR>Jan 30 11:05:46
bo kernel: Stack: <0>00000058 ec8e1000 c59b9f44 ee3ac5bd c59b9f44 00000000
c586e570 c0137100<BR>Jan 30 11:05:46 bo kernel:
c59b9f20 00000058 00000000 00000000 002c0000 00000000 eb121e74 ec8e1000<BR>Jan
30 11:05:46 bo kernel: ec8e1000 c59b9f74 ee39cd16
c59b9f5c c59b9f74 00000005 ed6e6820 ec8e102c<BR>Jan 30 11:05:46 bo kernel: Call
Trace:<BR>Jan 30 11:05:46 bo kernel: [<c0105431>]
show_stack_log_lvl+0xa1/0xe0<BR>Jan 30 11:05:46 bo kernel:
[<c0105621>] show_registers+0x181/0x200<BR>Jan 30 11:05:46 bo
kernel: [<c0105840>] die+0x100/0x1a0<BR>Jan 30 11:05:46 bo
kernel: [<c0105961>] do_trap+0x81/0xc0<BR>Jan 30 11:05:46 bo
kernel: [<c0105c45>] do_invalid_op+0xa5/0xb0<BR>Jan 30 11:05:46 bo
kernel: [<c0105097>] error_code+0x2b/0x30<BR>Jan 30 11:05:46 bo
kernel: [<ee3ac5bd>] drbd_rs_complete_io+0x5d/0x130 [drbd]<BR>Jan 30
11:05:46 bo kernel: [<ee39cd16>] w_e_end_rsdata_req+0x26/0x390
[drbd]<BR>Jan 30 11:05:46 bo kernel: [<ee39dcae>]
drbd_worker+0x2de/0x4b5 [drbd]<BR>Jan 30 11:05:46 bo kernel:
[<ee3b010c>] drbd_thread_setup+0x8c/0x100 [drbd]<BR>Jan 30 11:05:46 bo
kernel: [<c0102ec5>] kernel_thread_helper+0x5/0x10<BR>Jan 30
11:05:46 bo kernel: Code: c3 ff ff ff 8b 44 83 4c eb 0d 8b 10 0f 18 02 90 39 70
14 74 08 89 d0 85 c0 75 ef 31 c0 5b 5e 5d c3 0f 0b 79 00 30 f2 3b ee eb d0
<0f> 0b 78 00 30 f2 3b ee<BR>eb bf 89 f6 55 31 d2 89 e5 53 39 00 74<BR>Jan
30 11:05:46 bo kernel: <0>Fatal exception: panic in 5
seconds</FONT><BR></SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>Here is Simon
original analysis that may help track this:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT
face="Times New Roman" size=3>looks like another instance of the same bug we
fixed in the data path – you can’t look at mdev->resync or mdev->act_log
without first getting a local reference on the mdev… the way they fixed it
previously was to make sure this is done before calling drbd_al_complete_io in
all cases – just need to do the same with drbd_rs_complete_io as well I
think.<BR><BR>There are three places where I think this is not
done:<BR><BR>1. got_NegRSDReply in
drbd_receiver.c<BR>2. w_make_resync_request in
drbd_worker.c<BR>3. w_e_end_rsdata_req - this is the one
we actually crashed on here.<BR><BR>Taking each one in turn:<BR>1.
got_NegRSDReply -- possibly add inc_local_if_state()/dec_local()? Make sure you
still call<BR> dec_rs_pending() though -- only drbd_rs_complete_io() and
drbd_rs_failed_io() should<BR> be protected with the
inc_local_if_state(Failed).<BR>2. w_make_resync_request -- this one I think we
need to defer to Linbit; this is<BR> a complex routine and it's possible
it wont ever be called when the local disk<BR> has been detached (the
detach should stop the resync!) - However; I am concerned<BR> there is a
race condition where we could be in this routine when the disk gets<BR>
detached and we have no local ref on the mdev.<BR>3. w_e_end_rsdata_req -- this
is run as a worker item at the end of processing<BR> a resync data request
-- the bio completion routine is drbd_endio_read_sec<BR> and this is
decrementing the local count on the mdev before the work item<BR> runs --
I think the right fix here is to move the dec_local() from this<BR>
routine into w_e_end_rsdata_req after dec_unacked() is called (note
that<BR> there are TWO exits paths from this routine that do this - fix
both!)<BR><BR>Note that we should report this asap to Linbit -- there may be
some other places where mdev->resync is accessed without the proper
protection... </FONT></SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT
face="Times New Roman" size=3></FONT> </DIV></SPAN></FONT></BODY></HTML>