Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote: > We use a simple 2node active-passive cluster with DRBD and NFS services. > > Right now the cluster monitor detects a drbr failure every couple > hours (~ 2-40) and will fail over. Oh... I may have missed this context, and focused to much on the error log below. So you *do* have a working DRBD, and only the monitor operation fails "occasionally" (much too often, still), with the below error log. Did I understand correctly this time? > syslog shows the following lines just before pacepaker initiates the > failover: > > -------------------------------------- > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output: > (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output: > (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic > netlink family > Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM > operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32, > confirmed=false) not running > Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice: > attrd_trigger_update: Sending flush op to all hosts for: > fail-count-p_drbd_r0:0 (1) > > -------------------------------------- > > does anyone has a clue why this might happen? > It only seems to happen when drbd runs primary on nodeA, though this > node is to be designed to be always primary as long as it's > online... > > thanks > Christoph Roethlisberger -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com