[DRBD-user] slab.c

Wed May 12 16:32:44 CEST 2004

/ 2004-05-12 16:09:15 +0200
\ Philipp Reisner:
> > The other problem is that the module seems to crash the machine when I try
> > to reload it after it has been unloaded. After having unloaded the module I
> > get:
> >
> > drbd0: short read expecting header on sock: r=-512
> > drbd0: worker terminated
> > drbd0: asender terminated
> > drbd0: Connection lost.
> > drbd0: receiver terminated
> > drbd0: worker terminated
> > drbd0: ASSERT( mdev->ee_vacant==0 )
> > in /root/src/drbd-0.7_pre7/drbd/drbd_main.c:1417
> > slab error in kmem_cache_destroy(): cache `drbd_ee_cache': Can't free all
> > objects
> > Call Trace:
> >  [<c0147595>] kmem_cache_destroy+0xd5/0x120
> >  [<e1108ab8>] drbd_destroy_mempools+0x58/0x90 [drbd]
> >  [<e1115f15>] drbd_cleanup+0x215/0x4b5 [drbd]
> >  [<c01383db>] sys_delete_module+0x15b/0x1b0
> >  [<c015264e>] do_munmap+0x16e/0x1f0
> >  [<c01062db>] syscall_call+0x7/0xb
> >
> > drbd: kmem_cache_destroy(drbd_ee_cache) FAILED
> >
> 
> This is interesting! Only the assertion in drbd_main.c:1417 fires,
> but not the ERR() statements above. Where is this ee ?
> 
> Could you please retry with this patch applied ?
> 
> RCS file: /var/lib/cvs/drbd/drbd/drbd/drbd_main.c,v
> retrieving revision 1.73.2.171
> diff -u -p -u -r1.73.2.171 drbd_main.c
> --- drbd/drbd_main.c    12 May 2004 10:00:47 -0000      1.73.2.171
> +++ drbd/drbd_main.c    12 May 2004 14:07:57 -0000
> @@ -1417,6 +1417,7 @@ ONLY_IN_26(
>                         if(rr) ERR("%d: %d EEs in read list found!\n",i,rr);
> 
>                         D_ASSERT(mdev->ee_vacant==0);
> +                       D_ASSERT(list_empty(&mdev->data.work.q));
> 
>                         if (mdev->md_io_page)
>                                 __free_page(mdev->md_io_page);
> 
> 
> If this new assertion triggers, then at least we know where this
> missing ee is.

yes, I guess it is there.
thats why put this "goto again;" in the worker cleanup path, which now
already is in CVS; but still, there seems to be something "on the fly
somewhere..." otherwise the ASSERT in the worker thread had triggered...
unless the root of the problem was that unbalanced dec_ap_pending on
failed barrier send...

so try the above, and/or retry with CVS...

	Lars