[DRBD-user] drbd 7 + xfs + 2.6.7

Thu Jul 22 22:15:34 CEST 2004

cc'ed philipp and lmb, so they won't miss this mail, burried in some
"uninteressting" thread.

/ 2004-07-22 20:51:20 +0000
\ Florin Cazacu:
> Lars Ellenberg wrote:
> 
> >so please revert that change for now,
> >and disable all use of drbd_send_page,
> >like below.
> >
> > 
> >
> I disabled  drbd_send_page, and it looks like is working ok. I ran a 
> bonnie benchmark and it looks like is holding ok.

ok.
now, to help to find the actual problem, you could revert that again,
but now recompile and install a new kernel with
"kernel-hacking" ->
 [*] Kernel debugging
 [*] Debug memory allocations
 [*] Page alloc debugging
or even enable xfs debugging...
then recompile drbd, of course.
and then trigger it again, maybe the logs show something more
interessting then...


for the record, I think the problem is this:

facts:
  xfs makes heavy use of slabs (kmem_zone_alloc, which maps to
  kmem_cache_alloc). all of those pages have the PG_slab set.
  eventually it submits them for io.
  the page now reaches drbd, which by tcp_sendpage puts a reference to
  it in the tcp sendbuffer. the tcp_stack first get_page() it, of
  course.  when the socket buffer is cleaned up after tcp ack is
  received, or the socket is shutdown, or whatever: it put_page() it
  again.
  now, the stack traces show that at this point the page_count()
  reaches zero, so it actually is freed now.
  since it has PG_slab set -=> BOOM.

analysis:
  either: the page_count() _IS_ already zero when it is submitted to
  drbd. this way, the tcp stack had the only reference to it, and
  put_page() would try to free a slab page.
    this seems very unlikely, and we could easily put an assert early in
    the drbd code to prove this wrong.

  or: xfs for some reason kmem_zone_free's (kmem_cache_free) the
  submitted pages _before_ they are sent (so before io on that page has
  completed. no bio_endio called yet!). which means that xfs "frees" a
  page of which the tcp stack still holds a reference.
    this seem to be the likely code path.
    now. either no one except xfs may hold a reference to their pages.
    then xfs should prominently state this somewhere.
    or xfs just does something it must not do: freeing pages that have
    reference counts.

  someone wants to ask the xfs guys about this?

solution approaches:
  a. we could disable zero copy networking completely (tcp_sendpage).
  b. we could make it configurable.
  c. we could simply fall back to tcp_sendmsg for slab pages.

patch for c. is attached.  if it works for Florin (please confirm),
then it will go into svn soonish.


any comments?


	Lars Ellenberg


well.
drbd-"user" seems to be a very mixed newbie, beginners, users,
power users, and developers list...
but as long as nobody complains ...
  :)
-------------- next part --------------
Index: drbd_main.c
===================================================================

--- drbd_main.c	(revision 1448)
+++ drbd_main.c	(working copy)
@@ -883,12 +883,35 @@
    that we do not reuse our own buffer pages (EEs) to early, therefore
    we have the net_ee list. 
 */
+int _drbd_no_send_page(drbd_dev *mdev, struct page *page,
+                   int offset, size_t size)
+{
+       int ret;
+       ret = drbd_send(mdev, mdev->data.socket, kmap(page) + offset, size, 0);
+       kunmap(page);
+       return ret;
+}
+
 int _drbd_send_page(drbd_dev *mdev, struct page *page,
 		    int offset, size_t size)
 {
 	int sent,ok;
 	int len   = size;
 
+	/* PARANOIA. if this ever triggers,
+	 * something in the layers above us is really kaputt */
+	ERR_IF (page_count(page) < 1) {
+		ERR("someone wants to send a free page!\n");
+		dump_stack();
+		return _drbd_no_send_page(mdev, page, offset, size);
+	}
+
+	if (PageSlab(page)) {
+		/* probably xfs. fall back to sendmsg instead of sendpage.
+		 */
+		return _drbd_no_send_page(mdev, page, offset, size);
+	}
+
 	spin_lock(&mdev->send_task_lock);
 	mdev->send_task=current;
 	spin_unlock(&mdev->send_task_lock);