Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Mar 01, 2012 at 11:27:01AM +0100, Lars Ellenberg wrote: > On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote: > > We use a simple 2node active-passive cluster with DRBD and NFS services. > > > > Right now the cluster monitor detects a drbr failure every couple > > hours (~ 2-40) and will fail over. > > > Oh... I may have missed this context, and focused to much on the error > log below. > > So you *do* have a working DRBD, > and only the monitor operation fails "occasionally" (much too often, > still), with the below error log. > > Did I understand correctly this time? > > > syslog shows the following lines just before pacepaker initiates the > > failover: > > > > -------------------------------------- > > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output: > > (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket That error message above is triggered only, if - calloc(1, sizeof(struct genl_sock)) fails very unlikely, that's a few bytes, you would be that hard out of memory that you should know... - s->s_fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC); fails. - any of err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) || setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) || bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local)); fails. All of which are not likely to fail "only occasionally". You could run a tight loop: i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo "failed after $i calls" If that fails after some time, you could repeat that with i=0; while strace -x -s 1024 -o /tmp/whatever.strace.out drbdsetup 0 dstate >/dev/null ; do let i++; done; echo "failed after $i calls" Now you should have an strace of the failed run in that file, which we could analyse... > > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output: > > (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic > > netlink family > > Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM > > operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32, > > confirmed=false) not running > > Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-p_drbd_r0:0 (1) > > > > -------------------------------------- > > > > does anyone has a clue why this might happen? > > It only seems to happen when drbd runs primary on nodeA, though this > > node is to be designed to be always primary as long as it's > > online... > > > > thanks > > Christoph Roethlisberger > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed