[DRBD-user] Can't unload ib_sdp module after "drbdadm down all" (fishy module refcount)

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jan 4 18:08:41 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, Jan 04, 2012 at 04:40:57PM +0100, Lars Ellenberg wrote:
> On Tue, Jan 03, 2012 at 08:16:15PM +0100, Florian Haas wrote:
> > Hi,
> > 
> > DRBD 8.3.12 on CentOS 6.2; SDP from kernel-ib-1.5.3, built locally
> 
> You too, of all people?
> 
> Something crops up, and since DRBD is used at the same time,
> it has to be DRBD's fault?
> 
> I mean, that's possible, of course. But ...

Hm, well, I love to correct my own arrogant statements  ;-)

> > # drbdadm down all; lsmod | grep ib_sdp
> > ib_sdp                130827  4294967294
> > 
> > 4 billion references on that module look excessive. :) I suppose the
> > refcount incorrectly goes negative.
> 
> Sure. That's a -2.
> 
> > This is inconvenient as you're now unable to unload ib_sdp. I presume
> > this is a bug;
> 
> /me too ;-)
> 
> Only at this point I doubt it is a DRBD bug.
>
> All module refcount stuff is implicit, so I would expect the module
> count on all other network related modules to go wrong as well.

Hm. We'll see about that.

> Besides, I think I complained about that to the OFED guys
> about two and a half years ago already,
> when I helped to fix their memleak and frame corruption.
> 
> Never pressed the issue, though,
> and can not remember any useful response.
> 
> Of course it _may_ be DRBD, or something that DRBD could work around,
> but I suspect it is something in the OFED stack.
> If they reason otherwise, I'll listen.
> 
> > if I can provide any traces or debug logs to narrow
> > down the issue I'll be happy to.
> 
> Let us know what you find out.

Based on some git blaming, I found

drbd: 53eb779 (July 2008)
kernel: ac5a488e (long ago), 1b08534e (Dec 2008)

The relevant part of the latter is:

commit 1b08534e562dae7b084326f8aa8cc12a4c1b6593
net: Fix module refcount leak in kernel_accept()
...

diff --git a/net/socket.c b/net/socket.c
index 92764d8..76ba80a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2307,6 +2307,7 @@ int kernel_accept(struct socket *sock, struct socket **newsock, int flags)
 	}
 
 	(*newsock)->ops = sock->ops;
+	__module_get((*newsock)->ops->owner);
 
 done:
 	return err;


So. We are doing it as the kernel was doing it back in July 2008,
only the kernel was doing it wrong, and got fixed in December :-/

You can verify if you see such imbalance when using ipv6 (as a module) as well.

And you can try a patch:

diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 0e55c45..7decee3 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -528,6 +528,7 @@ STATIC int drbd_accept(struct drbd_conf *mdev, const char **what,
 		goto out;
 	}
 	(*newsock)->ops  = sock->ops;
+	__module_get((*newsock)->ops->owner);
 
 out:
 	return err;


Thanks,

	Lars


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.



More information about the drbd-user mailing list