[DRBD-user] Pacemaker - DRBD fails on node every couple hours

Lars Ellenberg lars.ellenberg at linbit.com
Thu Mar 1 16:04:56 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Mar 01, 2012 at 01:22:22PM +0100, Christoph Roethlisberger wrote:
> I did the loop a couple times and it aways "failed" rather soon:
> 
> -------------------------------------------------------------
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 28492 calls
> 
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 31887 calls
> 
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 10861 calls
> -------------------------------------------------------------
> 
> 
> attached you should find the output from the run run with strace.

bind(3, {sa_family=AF_NETLINK, pid=1432, groups=00000000}, 12) = -1
EADDRINUSE (Address already in use)                                                                                                        
write(2, "<1>error creating netlink socket\n", 33) = 33   

That "should not happen", as the pid (port id) is unique,
because it is set to the pid (process id).
Oh well.
We can ask the kernel to assign that "port id" for us.

Please try if this preliminary patch improves the situation for you.
Let me know if you need any help with applying/rebuilding this.

diff --git a/user/libgenl.c b/user/libgenl.c
index 0a6ea2e..713d653 100644
--- a/user/libgenl.c
+++ b/user/libgenl.c
@@ -26,15 +26,17 @@ int genl_join_mc_group(struct genl_sock *s, const char *name) {
 static struct genl_sock *genl_connect(__u32 nl_groups)
 {
 	struct genl_sock *s = calloc(1, sizeof(*s));
+	int sock_len;
 	int err;
+	int pid = getpid();
 	int bsz = 2 << 10;
 
 	if (!s)
 		return NULL;
 
-	/* the netlink port id - use the process id, it is unique,
-	 * and "everyone else does it". */
-	s->s_local.nl_pid = getpid();
+	/* autobind; kernel is responsible to give us something unique
+	 * in bind() below. */
+	s->s_local.nl_pid = 0;
 	s->s_local.nl_family = AF_NETLINK;
 	/*
 	 * If we want to receive multicast traffic on this socket, kernels
@@ -50,9 +52,15 @@ static struct genl_sock *genl_connect(__u32 nl_groups)
 	if (s->s_fd == -1)
 		goto fail;
 
+	sock_len = sizeof(s->s_local);
 	err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) ||
 	      setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) ||
-	      bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local));
+	      bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local)) ||
+	      getsockname(s->s_fd, (struct sockaddr*) &s->s_local, &sock_len);
+
+	dbg(pid != s_local.nl_pid ? 1 : 3,
+		"bound socket to nl_pid:%u, my pid:%u, len:%d, sizeof:%u\n",
+		s->s_local.nl_pid, pid, sock_len, sizeof(s->s_local));
 
 	if (err)
 		goto fail;



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list