[Drbd-dev] DRBD8:Panics in lc_find due to null lc after Scsi MEDIUM_ERROR I.O error.

Montrose, Ernest Ernest.Montrose at stratus.com
Wed Jan 31 14:58:41 CET 2007


This one is automatic.  During a read an IO errors occured.  The
specific error is ATA stat/err 0x51/40 translated into SCSI error
0x3/11/04 or Medium error/Read error/auto-reallocation failed.  This
error appears to be a ATA Uncorectable ECC error...So though the read
may have completed,
the data is not validated.  This is just for completeness in helping to
understand this. The main issue is that drbd ends up with a Null lc.
Here is
the stack trace.  Also see Simon Graham very good analysis after:
 
Jan 30 11:05:46 bo kernel: ------------[ cut here ]------------
Jan 30 11:05:46 bo kernel: kernel BUG at
/sandbox/emontros/devel/trunk_drbd8/platform/drbd/src/drbd/lru_cache.c:1
20!
Jan 30 11:05:46 bo kernel: invalid opcode: 0000 [#1]
Jan 30 11:05:46 bo kernel: SMP
Jan 30 11:05:46 bo kernel: Modules linked in: drbd cn bridge ipv6
ipmi_devintf ipmi_si ipmi_msghandler binfmt_misc dm_mirror video thermal
processor fan container button battery ac hw_random i2c_i801 i2c_core
shpchp pci_hotplug e1000 piix ide_cd cdrom raid1 dm_mod ide_disk
ata_piix libata sd_mod scsi_mod
Jan 30 11:05:46 bo kernel: CPU:    0
Jan 30 11:05:46 bo kernel: EIP:    0061:[<ee3ad1c4>]    Tainted: GF
VLI
Jan 30 11:05:46 bo kernel: EFLAGS: 00010046  (2.6.16.29-xen #1)
Jan 30 11:05:46 bo kernel: EIP is at lc_find+0x44/0x50 [drbd]
Jan 30 11:05:46 bo kernel: eax: 00000000  ebx: 00000000  ecx: ec8e13b0
edx: 00000058
Jan 30 11:05:46 bo kernel: esi: 00000058  edi: ec8e13b0  ebp: c59b9f08
esp: c59b9f00
Jan 30 11:05:46 bo kernel: ds: 007b  es: 007b  ss: 0069
Jan 30 11:05:46 bo kernel: Process drbd1_worker (pid: 6253,
threadinfo=c59b8000 task=c586e570)
Jan 30 11:05:46 bo kernel: Stack: <0>00000058 ec8e1000 c59b9f44 ee3ac5bd
c59b9f44 00000000 c586e570 c0137100
Jan 30 11:05:46 bo kernel:        c59b9f20 00000058 00000000 00000000
002c0000 00000000 eb121e74 ec8e1000
Jan 30 11:05:46 bo kernel:        ec8e1000 c59b9f74 ee39cd16 c59b9f5c
c59b9f74 00000005 ed6e6820 ec8e102c
Jan 30 11:05:46 bo kernel: Call Trace:
Jan 30 11:05:46 bo kernel:  [<c0105431>] show_stack_log_lvl+0xa1/0xe0
Jan 30 11:05:46 bo kernel:  [<c0105621>] show_registers+0x181/0x200
Jan 30 11:05:46 bo kernel:  [<c0105840>] die+0x100/0x1a0
Jan 30 11:05:46 bo kernel:  [<c0105961>] do_trap+0x81/0xc0
Jan 30 11:05:46 bo kernel:  [<c0105c45>] do_invalid_op+0xa5/0xb0
Jan 30 11:05:46 bo kernel:  [<c0105097>] error_code+0x2b/0x30
Jan 30 11:05:46 bo kernel:  [<ee3ac5bd>] drbd_rs_complete_io+0x5d/0x130
[drbd]
Jan 30 11:05:46 bo kernel:  [<ee39cd16>] w_e_end_rsdata_req+0x26/0x390
[drbd]
Jan 30 11:05:46 bo kernel:  [<ee39dcae>] drbd_worker+0x2de/0x4b5 [drbd]
Jan 30 11:05:46 bo kernel:  [<ee3b010c>] drbd_thread_setup+0x8c/0x100
[drbd]
Jan 30 11:05:46 bo kernel:  [<c0102ec5>] kernel_thread_helper+0x5/0x10
Jan 30 11:05:46 bo kernel: Code: c3 ff ff ff 8b 44 83 4c eb 0d 8b 10 0f
18 02 90 39 70 14 74 08 89 d0 85 c0 75 ef 31 c0 5b 5e 5d c3 0f 0b 79 00
30 f2 3b ee eb d0 <0f> 0b 78 00 30 f2 3b ee
eb bf 89 f6 55 31 d2 89 e5 53 39 00 74
Jan 30 11:05:46 bo kernel:  <0>Fatal exception: panic in 5 seconds

Here is Simon original analysis that may help track this:
looks like another instance of the same bug we fixed in the data path -
you can't look at mdev->resync or mdev->act_log without first getting a
local reference on the mdev... the way they fixed it previously was to
make sure this is done before calling drbd_al_complete_io in all cases -
just need to do the same with drbd_rs_complete_io as well I think.

There are three places where I think this is not done:

1.    got_NegRSDReply in drbd_receiver.c
2.    w_make_resync_request in drbd_worker.c
3.    w_e_end_rsdata_req - this is the one we actually crashed on here.

Taking each one in turn:
1. got_NegRSDReply -- possibly add inc_local_if_state()/dec_local()?
Make sure you still call
  dec_rs_pending() though -- only drbd_rs_complete_io() and
drbd_rs_failed_io() should
  be protected with the inc_local_if_state(Failed).
2. w_make_resync_request -- this one I think we need to defer to Linbit;
this is
  a complex routine and it's possible it wont ever be called when the
local disk
  has been detached (the detach should stop the resync!) - However; I am
concerned
  there is a race condition where we could be in this routine when the
disk gets
  detached and we have no local ref on the mdev.
3. w_e_end_rsdata_req -- this is run as a worker item at the end of
processing
  a resync data request -- the bio completion routine is
drbd_endio_read_sec
  and this is decrementing the local count on the mdev before the work
item
  runs -- I think the right fix here is to move the dec_local() from
this
  routine into w_e_end_rsdata_req after dec_unacked() is called (note
that
  there are TWO exits paths from this routine that do this - fix both!)

Note that we should report this asap to Linbit -- there may be some
other places where mdev->resync is accessed without the proper
protection... 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20070131/2dc1c706/attachment.html


More information about the drbd-dev mailing list