[Drbd-dev] DRBD-8: recent regression causing corruption and crashes

Fri Aug 11 04:31:43 CEST 2006

It seems that there has been a recent regression (somewhere in the past
couple of weeks) that is causing me a couple of nasty and very
reproducible problems as follows:

1. I get errors during initial synchronization of a volume like this
that cause the resync to be aborted:

drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: short read expecting header on sock: r=-512
drbd15: tl_clear()

	and

drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
/sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:308
drbd15: tl_verify: found req eaffff28 but it has wrong sector (ac030
versus 0)

	(Note that this trace is the enhanced trace I posted a few hours
ago). When I turn on the packet tracing I get some very
	odd messages, including lots of WriteAck's with 0 for the sector
(but everything else correct - SYNCER_ID for the block id and an 
      incrementing sequence number 0 like this:

Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1671)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1672)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1673)

	but I don't see the Ack's with non-SYNCER_ID block id that cause
the previous trace (I'm guessing this is just
	printk not writing all trace messages due to the rate). At first
I thought this was just a bug in the printing
	code but I'm damned if I can spot it!

2. I get panics with the following signature:- these look like they are
happening when a local write
    on the primary (which this node is) completes.

drbd15: tl_verify: failed to find req eab1ce48, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: drbd_pp_alloc interrupted!
drbd15: error receiving RSDataRequest, l: 24!
drbd15: tl_clear()
general protection fault: 6fa0 [#1]
Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
thermal processor fan button battery ac
CPU:    0
EIP:    0061:[<eafabf4f>]    Not tainted VLI
EFLAGS: 00010096   (2.6.16.13-xen0 #2) 
EIP is at 0xeafabf4f
eax: eafabf3f   ebx: 00000001   ecx: 00000001   edx: 00000003
esi: eafabf48   edi: bf48eafa   ebp: c0557cbc   esp: c0557c9c
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, threadinfo=c0556000 task=c04c6860)
Stack: <0>c0115a98 eafabf40 00000003 00000000 00000000 00000000 00000001
00000020 
       c0557ce0 c0115b03 eaf97634 00000003 00000001 00000000 00000000
00000001 
       eaf975e8 c0557cec c021c27f 00000000 c0557d20 c044ca27 eaf97624
eaf97624 
Call Trace:
 [<c010513a>] show_stack_log_lvl+0xaa/0xe0
 [<c010534e>] show_registers+0x18e/0x210
 [<c0105549>] die+0xd9/0x180
 [<c0105eb4>] do_general_protection+0xc4/0x100
 [<c0104d5f>] error_code+0x2b/0x30
 [<c0115b03>] __wake_up+0x33/0x60
 [<c021c27f>] __up+0x1f/0x30
 [<c044ca27>] __up_wakeup+0x7/0x10
 [<f127f107>] drbd_endio_write_pri+0xa7/0x230 [drbd]
 [<c015efaf>] bio_endio+0x4f/0x80
 [<c03cf569>] dec_pending+0x49/0x80
 [<c03cf64b>] clone_endio+0xab/0xd0
 [<c015efaf>] bio_endio+0x4f/0x80
 [<c020e260>] __end_that_request_first+0xc0/0x340
 [<c020e52f>] end_that_request_chunk+0x1f/0x30
 [<c0310b57>] scsi_end_request+0x37/0x110
 [<c0310ea4>] scsi_io_completion+0x104/0x530
 [<c037c4fa>] sd_rw_intr+0xca/0x290
 [<c030bdd5>] scsi_finish_command+0x85/0xa0
 [<c03118cb>] scsi_softirq_done+0xcb/0xf0
 [<c020e5cb>] blk_done_softirq+0x8b/0xb0
 [<c011e4c2>] __do_softirq+0x62/0xd0
 [<c011e578>] do_softirq+0x48/0x60
 [<c011e645>] irq_exit+0x35/0x40
 [<c01067a2>] do_IRQ+0x22/0x30
 [<c02dfb74>] evtchn_do_upcall+0x54/0xa0
 [<c0104d90>] hypervisor_callback+0x2c/0x34
 [<c0102b36>] cpu_idle+0x66/0x70
 [<c010202b>] _stext+0x2b/0x40
 [<c055886a>] start_kernel+0x1aa/0x1f0
 [<c010006f>] 0xc010006f
Code: 00 a0 fa ea 6c bf fa ea 25 e4 44 c0 34 76 f9 ea 00 a0 fa ea 30 20
1d eb 00 00 00 00 01 00 00 00 30 20 1d eb 48 bf fa ea 48 bf fa <ea> 34
76 f9 ea a0 6f ad ea e8 75 f9 ea 24 76 f9 ea a0 6f ad ea 
 <0>Kernel panic - not syncing: Fatal exception in interrupt

At the moment I'm somewhat stumped by this and would welcome any
thoughts on where to look...

Thanks
Simon

PS: Should have explained what we did before hand - the sequence is:

1. Both nodes issue drbdmeta commands to initialize the disk area - this
is done independently - that is, the systems
   are not connected at this point.

2. One node mounts the drbd disk and untars a blob of data, still not
connected to the other node

3, The systems are connected and rebooted and start resyncing
automatically, leading to the above problems.
   Note that we are also writing the volume on the SyncSource during the
resync.