[Drbd-dev] DRBD8: panic accessing NULL lru cache
Montrose, Ernest
Ernest.Montrose at stratus.com
Wed May 9 23:40:10 CEST 2007
We are seeing a panic in drbd_al_to_on_disk_bm()
Below is the stack and a possible cause:
May 3 05:17:59 choip kernel: EIP is at drbd_al_to_on_disk_bm+0x18/0x470
[drbd]
May 3 05:17:59 choip kernel: eax: 00000000 ebx: ecbd013c ecx:
00000000 edx: 00000000
May 3 05:17:59 choip kernel: esi: ecbd013c edi: 00000001 ebp:
eb2b3ebc esp: eb2b3e34
May 3 05:17:59 choip heartbeat: [5898]: info: standby: acquire [all]
resources from chois.sn.stratus.com
May 3 05:17:59 choip kernel: ds: 007b es: 007b ss: 0069
May 3 05:17:59 choip heartbeat: [13993]: info: acquire all HA resources
(standby).
May 3 05:17:59 choip kernel: Process drbd15_receiver (pid: 5758,
threadinfo=eb2b2000 task=c591c030)
May 3 05:17:59 choip kernel: Stack: <0>00000000 e7d8fe98 eb2b3eaa
cabe263a eb2b3e78 ee238515 00001000 000000d0
May 3 05:17:59 choip kernel: 0000002c 0000009f 0000003c 00000004
000000d0 eb2b3e80 00000002 eb2b3e94
May 3 05:17:59 choip ResourceManager[14004]: info: Acquiring resource
group: choip.sn.stratus.com drbddisk::shared.fs
Filesystem::/dev/drbd15::/shared 134.111.32.220 httpd smd
May 3 05:17:59 choip kernel: eb2b3e80 eb2b3ebc ee42533c 00000004
00000001 0000009f 00000000 00000016
May 3 05:17:59 choip kernel: Call Trace:
May 3 05:17:59 choip kernel: [<c0105a01>] show_stack_log_lvl+0xa1/0xe0
May 3 05:17:59 choip ResourceManager[14004]: info: Running
/etc/ha.d/resource.d/drbddisk shared.fs start
May 3 05:17:59 choip kernel: [<c0105bf1>] show_registers+0x181/0x200
May 3 05:17:59 choip kernel: [<c0105e10>] die+0x100/0x1b0
May 3 05:17:59 choip kernel: [<c01168f6>] do_page_fault+0x3c6/0x8c1
May 3 05:17:59 choip kernel: [<c010565f>] error_code+0x2b/0x30
May 3 05:17:59 choip kernel: [<ee41ad8e>] after_state_ch+0x77e/0xa70
[drbd]
May 3 05:17:59 choip kernel: [<ee40e1b1>] receive_state+0x281/0x3c0
[drbd]
May 3 05:17:59 choip kernel: [<ee40e8a2>] drbdd+0x42/0x170 [drbd]
May 3 05:17:59 choip kernel: [<ee40fc05>] drbdd_init+0x1c5/0x210
[drbd]
May 3 05:17:59 choip kernel: [<ee41b10c>] drbd_thread_setup+0x8c/0x100
[drbd]
May 3 05:17:59 choip kernel: [<c0103485>]
kernel_thread_helper+0x5/0x10
May 3 05:17:59 choip kernel: Code: ff ff ff 8b 52 0c eb 94 8d 74 26 00
8d bc 27 00 00 00 00 55 89 e5 57 56 53 83 ec 7c c7 45 90 00 10 00 00 89
c3 8b 80 c0 03 00 00 <f0> 0f ba 68 28 01 19 d2 31 c0 85 d2 0f 94 c0 85
c0 75 76 fc b9
May 3 05:17:59 choip kernel: <0>Fatal exception: panic in 5 seconds
======================
OK ...wait_event() is a macro and lc_try_lock() is inline.
so....lc->flags below is likely where we died.
static inline int lc_try_lock(struct lru_cache* lc)
{
return !test_and_set_bit(__LC_DIRTY,&lc->flags); <=====I think
we are here!!!!!
}
Dump of assembler code for function drbd_al_to_on_disk_bm:
0x00013520 <drbd_al_to_on_disk_bm+0>: push %ebp
0x00013521 <drbd_al_to_on_disk_bm+1>: mov %esp,%ebp
0x00013523 <drbd_al_to_on_disk_bm+3>: push %edi
0x00013524 <drbd_al_to_on_disk_bm+4>: push %esi
0x00013525 <drbd_al_to_on_disk_bm+5>: push %ebx
0x00013526 <drbd_al_to_on_disk_bm+6>: sub $0x7c,%esp
0x00013529 <drbd_al_to_on_disk_bm+9>: movl $0x1000,0xffffff90(%ebp)
0x00013530 <drbd_al_to_on_disk_bm+16>: mov %eax,%ebx
0x00013532 <drbd_al_to_on_disk_bm+18>: mov 0x3c0(%eax),%eax
<=====dead here!!!
0x00013538 <drbd_al_to_on_disk_bm+24>: lock btsl $0x1,0x28(%eax)
0x0001353e <drbd_al_to_on_disk_bm+30>: sbb %edx,%edx
0x00013540 <drbd_al_to_on_disk_bm+32>: xor %eax,%eax
0x00013542 <drbd_al_to_on_disk_bm+34>: test %edx,%edx
0x00013544 <drbd_al_to_on_disk_bm+36>: sete %al
0x00013547 <drbd_al_to_on_disk_bm+39>: test %eax,%eax
0x00013549 <drbd_al_to_on_disk_bm+41>: jne 0x135c1
<drbd_al_to_on_disk_bm+161>
0x0001354b <drbd_al_to_on_disk_bm+43>: cld
Here is theory since I cannot reproduce at will. It seems to me that on
the panic'ed node our disk had a fault
inserted. so we went diskless. At that point we call after_state_ch()
and did this:
if ( os.disk > Diskless && ns.disk == Diskless ) {
/* since inc_local() only works as long as
disk>=Inconsistent,
and it is Diskless here, local_cnt can only go down,
it can
not increase... It will reach zero */
wait_event(mdev->misc_wait,
!atomic_read(&mdev->local_cnt));
drbd_free_bc(mdev->bc); mdev->bc = NULL;
lc_free(mdev->resync); mdev->resync = NULL;
lc_free(mdev->act_log); mdev->act_log = NULL; //We free
things here!!!!
}
So we freed the the lrucache.
But later, the peer got a fault inserted and entered diskless so we set
peer to Secondary . The panic'ed node receives that state and call
after_state_ch() and did this:
if( ns.pdsk < Inconsistent ) {
/* Diskless Peer becomes primary */
if (os.peer == Secondary && ns.peer == Primary ) {
drbd_uuid_new_current(mdev);
}
/* Diskless Peer becomes secondary */
if (os.peer == Primary && ns.peer == Secondary ) {
drbd_al_to_on_disk_bm(mdev);
}
}
But the lc entry was freed by the time we called
drbd_al_to_on_disk_bm(). If this is correct, I am still not sure
how to best fix this.
Thanks
EM--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20070509/4862daac/attachment.htm
More information about the drbd-dev
mailing list