[Drbd-dev] DRBD8: panic accessing NULL lru cache

Montrose, Ernest Ernest.Montrose at stratus.com
Wed May 9 23:40:10 CEST 2007


We are seeing a panic in drbd_al_to_on_disk_bm()
 
Below is the stack and a possible cause:
May  3 05:17:59 choip kernel: EIP is at drbd_al_to_on_disk_bm+0x18/0x470
[drbd]
May  3 05:17:59 choip kernel: eax: 00000000  ebx: ecbd013c  ecx:
00000000  edx: 00000000
May  3 05:17:59 choip kernel: esi: ecbd013c  edi: 00000001  ebp:
eb2b3ebc  esp: eb2b3e34
May  3 05:17:59 choip heartbeat: [5898]: info: standby: acquire [all]
resources from chois.sn.stratus.com
May  3 05:17:59 choip kernel: ds: 007b  es: 007b  ss: 0069
May  3 05:17:59 choip heartbeat: [13993]: info: acquire all HA resources
(standby).
May  3 05:17:59 choip kernel: Process drbd15_receiver (pid: 5758,
threadinfo=eb2b2000 task=c591c030)
May  3 05:17:59 choip kernel: Stack: <0>00000000 e7d8fe98 eb2b3eaa
cabe263a eb2b3e78 ee238515 00001000 000000d0 
May  3 05:17:59 choip kernel:        0000002c 0000009f 0000003c 00000004
000000d0 eb2b3e80 00000002 eb2b3e94 
May  3 05:17:59 choip ResourceManager[14004]: info: Acquiring resource
group: choip.sn.stratus.com drbddisk::shared.fs
Filesystem::/dev/drbd15::/shared 134.111.32.220 httpd smd
May  3 05:17:59 choip kernel:        eb2b3e80 eb2b3ebc ee42533c 00000004
00000001 0000009f 00000000 00000016 
May  3 05:17:59 choip kernel: Call Trace:
May  3 05:17:59 choip kernel:  [<c0105a01>] show_stack_log_lvl+0xa1/0xe0
May  3 05:17:59 choip ResourceManager[14004]: info: Running
/etc/ha.d/resource.d/drbddisk shared.fs start
May  3 05:17:59 choip kernel:  [<c0105bf1>] show_registers+0x181/0x200
May  3 05:17:59 choip kernel:  [<c0105e10>] die+0x100/0x1b0
May  3 05:17:59 choip kernel:  [<c01168f6>] do_page_fault+0x3c6/0x8c1
May  3 05:17:59 choip kernel:  [<c010565f>] error_code+0x2b/0x30
May  3 05:17:59 choip kernel:  [<ee41ad8e>] after_state_ch+0x77e/0xa70
[drbd]
May  3 05:17:59 choip kernel:  [<ee40e1b1>] receive_state+0x281/0x3c0
[drbd]
May  3 05:17:59 choip kernel:  [<ee40e8a2>] drbdd+0x42/0x170 [drbd]
May  3 05:17:59 choip kernel:  [<ee40fc05>] drbdd_init+0x1c5/0x210
[drbd]
May  3 05:17:59 choip kernel:  [<ee41b10c>] drbd_thread_setup+0x8c/0x100
[drbd]
May  3 05:17:59 choip kernel:  [<c0103485>]
kernel_thread_helper+0x5/0x10
May  3 05:17:59 choip kernel: Code: ff ff ff 8b 52 0c eb 94 8d 74 26 00
8d bc 27 00 00 00 00 55 89 e5 57 56 53 83 ec 7c c7 45 90 00 10 00 00 89
c3 8b 80 c0 03 00 00 <f0> 0f ba 68 28 01 19 d2 31 c0 85 d2 0f 94 c0 85
c0 75 76 fc b9 
May  3 05:17:59 choip kernel:  <0>Fatal exception: panic in 5 seconds 
 
======================
OK ...wait_event() is a macro and lc_try_lock() is inline.
so....lc->flags below is likely where we died.
static inline int lc_try_lock(struct lru_cache* lc)
{
        return !test_and_set_bit(__LC_DIRTY,&lc->flags); <=====I think
we are here!!!!!
}

Dump of assembler code for function drbd_al_to_on_disk_bm:
0x00013520 <drbd_al_to_on_disk_bm+0>:  push  %ebp
0x00013521 <drbd_al_to_on_disk_bm+1>:  mov    %esp,%ebp
0x00013523 <drbd_al_to_on_disk_bm+3>:  push  %edi
0x00013524 <drbd_al_to_on_disk_bm+4>:  push  %esi
0x00013525 <drbd_al_to_on_disk_bm+5>:  push  %ebx
0x00013526 <drbd_al_to_on_disk_bm+6>:  sub    $0x7c,%esp
0x00013529 <drbd_al_to_on_disk_bm+9>:  movl  $0x1000,0xffffff90(%ebp)
0x00013530 <drbd_al_to_on_disk_bm+16>:  mov    %eax,%ebx
0x00013532 <drbd_al_to_on_disk_bm+18>:  mov    0x3c0(%eax),%eax
<=====dead here!!!
0x00013538 <drbd_al_to_on_disk_bm+24>:  lock btsl $0x1,0x28(%eax)
0x0001353e <drbd_al_to_on_disk_bm+30>:  sbb    %edx,%edx
0x00013540 <drbd_al_to_on_disk_bm+32>:  xor    %eax,%eax
0x00013542 <drbd_al_to_on_disk_bm+34>:  test  %edx,%edx
0x00013544 <drbd_al_to_on_disk_bm+36>:  sete  %al
0x00013547 <drbd_al_to_on_disk_bm+39>:  test  %eax,%eax
0x00013549 <drbd_al_to_on_disk_bm+41>:  jne    0x135c1
<drbd_al_to_on_disk_bm+161>
0x0001354b <drbd_al_to_on_disk_bm+43>:  cld

Here is theory since I cannot reproduce at will.  It seems to me that on
the panic'ed node our disk had a fault 
inserted. so we went diskless.  At that point we call after_state_ch()
and did this:
 if ( os.disk > Diskless && ns.disk == Diskless ) {
                /* since inc_local() only works as long as
disk>=Inconsistent,
                  and it is Diskless here, local_cnt can only go down,
it can
                  not increase... It will reach zero */
                wait_event(mdev->misc_wait,
!atomic_read(&mdev->local_cnt));

                drbd_free_bc(mdev->bc); mdev->bc = NULL;
                lc_free(mdev->resync);  mdev->resync = NULL;
                lc_free(mdev->act_log); mdev->act_log = NULL; //We free
things here!!!!
        }
So we freed the the lrucache.

But later, the peer got a fault inserted and entered diskless so we set
peer to Secondary .  The panic'ed node receives that state and call
after_state_ch() and did this:
    if( ns.pdsk < Inconsistent ) {
                /* Diskless Peer becomes primary */
                if (os.peer == Secondary && ns.peer == Primary ) {
                        drbd_uuid_new_current(mdev);
                }
                /* Diskless Peer becomes secondary */
                if (os.peer == Primary && ns.peer == Secondary ) {
                        drbd_al_to_on_disk_bm(mdev); 
                }
        }
But the lc entry was freed by the time  we called
drbd_al_to_on_disk_bm().  If this is correct, I am still not sure
how to best fix this.
 
Thanks
EM--
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20070509/4862daac/attachment.htm


More information about the drbd-dev mailing list