<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

<META content="MSHTML 6.00.5730.11" name=GENERATOR></HEAD>

<BODY>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>This one is 

automatic.&nbsp; During a read an IO errors occured.&nbsp; The specific error is 

ATA stat/err 0x51/40 translated into SCSI error 0x3/11/04 or Medium error/Read 

error/auto-reallocation failed.&nbsp; This error appears to be a ATA 

Uncorectable ECC error...So though the read may have 

completed,</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>the data is not 

validated.&nbsp; This is just for completeness in helping to understand this. 

The main issue is that drbd ends up with a Null lc.&nbsp; Here 

is</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>the stack 

trace.&nbsp; Also see Simon Graham very good analysis after:</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=973544613-31012007></SPAN></FONT>&nbsp;</DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT 

face="Times New Roman" size=3>Jan 30 11:05:46 bo kernel: ------------[ cut here 

]------------<BR>Jan 30 11:05:46 bo kernel: kernel BUG at 

/sandbox/emontros/devel/trunk_drbd8/platform/drbd/src/drbd/lru_cache.c:120!<BR>Jan 

30 11:05:46 bo kernel: invalid opcode: 0000 [#1]<BR>Jan 30 11:05:46 bo kernel: 

SMP<BR>Jan 30 11:05:46 bo kernel: Modules linked in: drbd cn bridge ipv6 

ipmi_devintf ipmi_si ipmi_msghandler binfmt_misc dm_mirror video thermal 

processor fan container button battery ac hw_random i2c_i801 i2c_core shpchp 

pci_hotplug e1000 piix ide_cd cdrom raid1 dm_mod ide_disk ata_piix libata sd_mod 

scsi_mod<BR>Jan 30 11:05:46 bo kernel: CPU:&nbsp; &nbsp; 0<BR>Jan 30 11:05:46 bo 

kernel: EIP:&nbsp; &nbsp; 0061:[&lt;ee3ad1c4&gt;]&nbsp; &nbsp; Tainted: GF&nbsp; 

&nbsp; VLI<BR>Jan 30 11:05:46 bo kernel: EFLAGS: 00010046&nbsp; (2.6.16.29-xen 

#1)<BR>Jan 30 11:05:46 bo kernel: EIP is at lc_find+0x44/0x50 [drbd]<BR>Jan 30 

11:05:46 bo kernel: eax: 00000000&nbsp; ebx: 00000000&nbsp; ecx: ec8e13b0&nbsp; 

edx: 00000058<BR>Jan 30 11:05:46 bo kernel: esi: 00000058&nbsp; edi: 

ec8e13b0&nbsp; ebp: c59b9f08&nbsp; esp: c59b9f00<BR>Jan 30 11:05:46 bo kernel: 

ds: 007b&nbsp; es: 007b&nbsp; ss: 0069<BR>Jan 30 11:05:46 bo kernel: Process 

drbd1_worker (pid: 6253, threadinfo=c59b8000 task=c586e570)<BR>Jan 30 11:05:46 

bo kernel: Stack: &lt;0&gt;00000058 ec8e1000 c59b9f44 ee3ac5bd c59b9f44 00000000 

c586e570 c0137100<BR>Jan 30 11:05:46 bo kernel:&nbsp; &nbsp; &nbsp; &nbsp; 

c59b9f20 00000058 00000000 00000000 002c0000 00000000 eb121e74 ec8e1000<BR>Jan 

30 11:05:46 bo kernel:&nbsp; &nbsp; &nbsp; &nbsp; ec8e1000 c59b9f74 ee39cd16 

c59b9f5c c59b9f74 00000005 ed6e6820 ec8e102c<BR>Jan 30 11:05:46 bo kernel: Call 

Trace:<BR>Jan 30 11:05:46 bo kernel:&nbsp; [&lt;c0105431&gt;] 

show_stack_log_lvl+0xa1/0xe0<BR>Jan 30 11:05:46 bo kernel:&nbsp; 

[&lt;c0105621&gt;] show_registers+0x181/0x200<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;c0105840&gt;] die+0x100/0x1a0<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;c0105961&gt;] do_trap+0x81/0xc0<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;c0105c45&gt;] do_invalid_op+0xa5/0xb0<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;c0105097&gt;] error_code+0x2b/0x30<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;ee3ac5bd&gt;] drbd_rs_complete_io+0x5d/0x130 [drbd]<BR>Jan 30 

11:05:46 bo kernel:&nbsp; [&lt;ee39cd16&gt;] w_e_end_rsdata_req+0x26/0x390 

[drbd]<BR>Jan 30 11:05:46 bo kernel:&nbsp; [&lt;ee39dcae&gt;] 

drbd_worker+0x2de/0x4b5 [drbd]<BR>Jan 30 11:05:46 bo kernel:&nbsp; 

[&lt;ee3b010c&gt;] drbd_thread_setup+0x8c/0x100 [drbd]<BR>Jan 30 11:05:46 bo 

kernel:&nbsp; [&lt;c0102ec5&gt;] kernel_thread_helper+0x5/0x10<BR>Jan 30 

11:05:46 bo kernel: Code: c3 ff ff ff 8b 44 83 4c eb 0d 8b 10 0f 18 02 90 39 70 

14 74 08 89 d0 85 c0 75 ef 31 c0 5b 5e 5d c3 0f 0b 79 00 30 f2 3b ee eb d0 

&lt;0f&gt; 0b 78 00 30 f2 3b ee<BR>eb bf 89 f6 55 31 d2 89 e5 53 39 00 74<BR>Jan 

30 11:05:46 bo kernel:&nbsp; &lt;0&gt;Fatal exception: panic in 5 

seconds</FONT><BR></SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007>Here is Simon 

original analysis that may help track this:</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT 

face="Times New Roman" size=3>looks like another instance of the same bug we 

fixed in the data path &#8211; you can&#8217;t look at mdev-&gt;resync or mdev-&gt;act_log 

without first getting a local reference on the mdev&#8230; the way they fixed it 

previously was to make sure this is done before calling drbd_al_complete_io in 

all cases &#8211; just need to do the same with drbd_rs_complete_io as well I 

think.<BR><BR>There are three places where I think this is not 

done:<BR><BR>1.&nbsp;&nbsp;&nbsp;&nbsp;got_NegRSDReply in 

drbd_receiver.c<BR>2.&nbsp;&nbsp;&nbsp;&nbsp;w_make_resync_request in 

drbd_worker.c<BR>3.&nbsp;&nbsp;&nbsp;&nbsp;w_e_end_rsdata_req - this is the one 

we actually crashed on here.<BR><BR>Taking each one in turn:<BR>1. 

got_NegRSDReply -- possibly add inc_local_if_state()/dec_local()? Make sure you 

still call<BR>&nbsp; dec_rs_pending() though -- only drbd_rs_complete_io() and 

drbd_rs_failed_io() should<BR>&nbsp; be protected with the 

inc_local_if_state(Failed).<BR>2. w_make_resync_request -- this one I think we 

need to defer to Linbit; this is<BR>&nbsp; a complex routine and it's possible 

it wont ever be called when the local disk<BR>&nbsp; has been detached (the 

detach should stop the resync!) - However; I am concerned<BR>&nbsp; there is a 

race condition where we could be in this routine when the disk gets<BR>&nbsp; 

detached and we have no local ref on the mdev.<BR>3. w_e_end_rsdata_req -- this 

is run as a worker item at the end of processing<BR>&nbsp; a resync data request 

-- the bio completion routine is drbd_endio_read_sec<BR>&nbsp; and this is 

decrementing the local count on the mdev before the work item<BR>&nbsp; runs -- 

I think the right fix here is to move the dec_local() from this<BR>&nbsp; 

routine into w_e_end_rsdata_req after dec_unacked() is called (note 

that<BR>&nbsp; there are TWO exits paths from this routine that do this - fix 

both!)<BR><BR>Note that we should report this asap to Linbit -- there may be 

some other places where mdev-&gt;resync is accessed without the proper 

protection... </FONT></SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=973544613-31012007><FONT 

face="Times New Roman" size=3></FONT>&nbsp;</DIV></SPAN></FONT></BODY></HTML>