[DRBD-user] Pacemaker - DRBD fails on node every couple hours

Thu Mar 1 12:15:24 CET 2012

On Thu, Mar 01, 2012 at 11:27:01AM +0100, Lars Ellenberg wrote:
> On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> > We use a simple 2node active-passive cluster with DRBD and NFS services.
> > 
> > Right now the cluster monitor detects a drbr failure every couple
> > hours (~ 2-40) and will fail over.
> 
> 
> Oh...  I may have missed this context, and focused to much on the error
> log below.
> 
> So you *do* have a working DRBD,
> and only the monitor operation fails "occasionally" (much too often,
> still), with the below error log.
> 
> Did I understand correctly this time?
> 
> > syslog shows the following lines just before pacepaker initiates the
> > failover:
> > 
> > --------------------------------------
> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket

That error message above is triggered only, if
 - calloc(1, sizeof(struct genl_sock)) fails  
    very unlikely, that's a few bytes, you would be that hard out of
    memory that you should know...
 -  s->s_fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC);
    fails.
 - any of
  err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) ||
        setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) ||
	bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local));
    fails.

All of which are not likely to fail "only occasionally".

You could run a tight loop:
  i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done;
  echo "failed after $i calls"

If that fails after some time,
you could repeat that with
  i=0; while strace -x -s 1024 -o /tmp/whatever.strace.out drbdsetup 0 dstate >/dev/null ; do let i++; done;
  echo "failed after $i calls"

Now you should have an strace of the failed run in that file,
which we could analyse...

> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> > netlink family
> > Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
> > operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
> > confirmed=false) not running
> > Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-p_drbd_r0:0 (1)
> > 
> > --------------------------------------
> > 
> > does anyone has a clue why this might happen?
> > It only seems to happen when drbd runs primary on nodeA, though this
> > node is to be designed to be always primary as long as it's
> > online...
> > 
> > thanks
> > Christoph Roethlisberger
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed