Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
cc'ed philipp and lmb, so they won't miss this mail, burried in some
"uninteressting" thread.
/ 2004-07-22 20:51:20 +0000
\ Florin Cazacu:
> Lars Ellenberg wrote:
>
> >so please revert that change for now,
> >and disable all use of drbd_send_page,
> >like below.
> >
> >
> >
> I disabled drbd_send_page, and it looks like is working ok. I ran a
> bonnie benchmark and it looks like is holding ok.
ok.
now, to help to find the actual problem, you could revert that again,
but now recompile and install a new kernel with
"kernel-hacking" ->
[*] Kernel debugging
[*] Debug memory allocations
[*] Page alloc debugging
or even enable xfs debugging...
then recompile drbd, of course.
and then trigger it again, maybe the logs show something more
interessting then...
for the record, I think the problem is this:
facts:
xfs makes heavy use of slabs (kmem_zone_alloc, which maps to
kmem_cache_alloc). all of those pages have the PG_slab set.
eventually it submits them for io.
the page now reaches drbd, which by tcp_sendpage puts a reference to
it in the tcp sendbuffer. the tcp_stack first get_page() it, of
course. when the socket buffer is cleaned up after tcp ack is
received, or the socket is shutdown, or whatever: it put_page() it
again.
now, the stack traces show that at this point the page_count()
reaches zero, so it actually is freed now.
since it has PG_slab set -=> BOOM.
analysis:
either: the page_count() _IS_ already zero when it is submitted to
drbd. this way, the tcp stack had the only reference to it, and
put_page() would try to free a slab page.
this seems very unlikely, and we could easily put an assert early in
the drbd code to prove this wrong.
or: xfs for some reason kmem_zone_free's (kmem_cache_free) the
submitted pages _before_ they are sent (so before io on that page has
completed. no bio_endio called yet!). which means that xfs "frees" a
page of which the tcp stack still holds a reference.
this seem to be the likely code path.
now. either no one except xfs may hold a reference to their pages.
then xfs should prominently state this somewhere.
or xfs just does something it must not do: freeing pages that have
reference counts.
someone wants to ask the xfs guys about this?
solution approaches:
a. we could disable zero copy networking completely (tcp_sendpage).
b. we could make it configurable.
c. we could simply fall back to tcp_sendmsg for slab pages.
patch for c. is attached. if it works for Florin (please confirm),
then it will go into svn soonish.
any comments?
Lars Ellenberg
well.
drbd-"user" seems to be a very mixed newbie, beginners, users,
power users, and developers list...
but as long as nobody complains ...
:)
-------------- next part --------------
Index: drbd_main.c
===================================================================
--- drbd_main.c (revision 1448)
+++ drbd_main.c (working copy)
@@ -883,12 +883,35 @@
that we do not reuse our own buffer pages (EEs) to early, therefore
we have the net_ee list.
*/
+int _drbd_no_send_page(drbd_dev *mdev, struct page *page,
+ int offset, size_t size)
+{
+ int ret;
+ ret = drbd_send(mdev, mdev->data.socket, kmap(page) + offset, size, 0);
+ kunmap(page);
+ return ret;
+}
+
int _drbd_send_page(drbd_dev *mdev, struct page *page,
int offset, size_t size)
{
int sent,ok;
int len = size;
+ /* PARANOIA. if this ever triggers,
+ * something in the layers above us is really kaputt */
+ ERR_IF (page_count(page) < 1) {
+ ERR("someone wants to send a free page!\n");
+ dump_stack();
+ return _drbd_no_send_page(mdev, page, offset, size);
+ }
+
+ if (PageSlab(page)) {
+ /* probably xfs. fall back to sendmsg instead of sendpage.
+ */
+ return _drbd_no_send_page(mdev, page, offset, size);
+ }
+
spin_lock(&mdev->send_task_lock);
mdev->send_task=current;
spin_unlock(&mdev->send_task_lock);