Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Mar 01, 2012 at 11:27:01AM +0100, Lars Ellenberg wrote:
> On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> > We use a simple 2node active-passive cluster with DRBD and NFS services.
> >
> > Right now the cluster monitor detects a drbr failure every couple
> > hours (~ 2-40) and will fail over.
>
>
> Oh... I may have missed this context, and focused to much on the error
> log below.
>
> So you *do* have a working DRBD,
> and only the monitor operation fails "occasionally" (much too often,
> still), with the below error log.
>
> Did I understand correctly this time?
>
> > syslog shows the following lines just before pacepaker initiates the
> > failover:
> >
> > --------------------------------------
> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket
That error message above is triggered only, if
- calloc(1, sizeof(struct genl_sock)) fails
very unlikely, that's a few bytes, you would be that hard out of
memory that you should know...
- s->s_fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC);
fails.
- any of
err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) ||
setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) ||
bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local));
fails.
All of which are not likely to fail "only occasionally".
You could run a tight loop:
i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done;
echo "failed after $i calls"
If that fails after some time,
you could repeat that with
i=0; while strace -x -s 1024 -o /tmp/whatever.strace.out drbdsetup 0 dstate >/dev/null ; do let i++; done;
echo "failed after $i calls"
Now you should have an strace of the failed run in that file,
which we could analyse...
> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> > netlink family
> > Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
> > operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
> > confirmed=false) not running
> > Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-p_drbd_r0:0 (1)
> >
> > --------------------------------------
> >
> > does anyone has a clue why this might happen?
> > It only seems to happen when drbd runs primary on nodeA, though this
> > node is to be designed to be always primary as long as it's
> > online...
> >
> > thanks
> > Christoph Roethlisberger
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed