[DRBD-user] Pacemaker - DRBD fails on node every couple hours

Thu Mar 1 11:27:01 CET 2012

On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> We use a simple 2node active-passive cluster with DRBD and NFS services.
> 
> Right now the cluster monitor detects a drbr failure every couple
> hours (~ 2-40) and will fail over.

Oh...  I may have missed this context, and focused to much on the error
log below.

So you *do* have a working DRBD,
and only the monitor operation fails "occasionally" (much too often,
still), with the below error log.

Did I understand correctly this time?

> syslog shows the following lines just before pacepaker initiates the
> failover:
> 
> --------------------------------------
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> netlink family
> Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
> operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
> confirmed=false) not running
> Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-p_drbd_r0:0 (1)
> 
> --------------------------------------
> 
> does anyone has a clue why this might happen?
> It only seems to happen when drbd runs primary on nodeA, though this
> node is to be designed to be always primary as long as it's
> online...
> 
> thanks
> Christoph Roethlisberger

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com