[DRBD-user] Cluster toggle with no reasons

Lars Ellenberg lars.ellenberg at linbit.com
Fri May 6 10:31:13 CEST 2016

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, May 04, 2016 at 01:36:57PM +0000, benjamin.linier at engie.com wrote:
> Hello,
> 
> I have over 10 sites in production with exactly the same standard M/S installation :
> 
>                 2 Nodes (server1 and server2)
> Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux
> DrbD 8.3.11 (api:88/proto:86-96)
> Corosync 1.4.2-3

That's all old software.
Not in the sense of "stable", but in the sense of "old".

> All sites are exactly identic because we deploy them with an automatic installation DVD built with SimpleCDD.
> 
> We have a serious problem on 1 site, sometimes, the MASTER node switch from server1 to server2 with no reason, and return back to server1. Sometimes the system toggle 2 or 3 times before return back to normal state.
> 
> This issue is not periodic. Sometimes it's happened after 2mounth of stability, or it can happened 15days after the last time.
> 
> This situation is critical because it can happened that the toggle corrupts some data, this is reflected by MySQL tables marked as crashed. (and our software stops)
> 
> Could you help to determine the possible root causes why the cluster become instable ?
> 
> I Suspected first the LAN but I done some tests in Labs, and when we make errors on the LAN we have in the log something like   "conn( WFConnection -> NetworkFailure )". It's not the case in production site. LAN semms to be OK.
> 
> Here is the production logs for server1 and server2 :

What you showed there is only an excerpt of DRBD kernel logs,
and it only show that
  someone "downs" a resource,
  later someone "ups" that resource again.
  and DRBD does a resync.

  all normal operation, from DRBD point of view.

If you did not intend for this to happen, you need to figure out
who or what downed the resource, and why.

Most likely some other admin -- you should ask around --,
or your cluster manager, in which case there should be plenty of logs
about what triggered that decision...


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list