[DRBD-user] ocf:linbit:drbd: DRBD Split-Brain not detected in non standard setup

Dr. Volker Jaenisch volker.jaenisch at inqbus.de
Fri Feb 24 15:08:04 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi DRBD_users!

We have a drbd 8.4 setup with two nodes over distance lines, with one
not so common feature, we like to discuss here.

Host A ========== 2 x 10GBit lines ========  Host B

   | --------------- DSL 2Mb/s Line ------------|

The nodes are connected via two redundant 10Gbit links which are port
channeled/bonded in an interface bond0 on the nodes. Lets call this bond
the "worker" connection.

Pacemaker (corosync) features two rings: One over bond0 (the worker
connection) and the second over an additional link (DSL peer-2-peer
2Mbit/sec). Lets call this DSL-Link the "heartbeat" connection. This
additional hearbeat line is the non standard part.

The configurations of corosync, pacemaker, drbd are from the textbook
and in no way spezial.


Testing disaster szenarios we discovered the following corner case:

If both 10Gbit links fail then the bond0 aka the worker connection fails
and DRBD goes - as expected - into split brain. But that is not the problem.

The problem is that pacemaker obviously does not notice the split brain.
Since the second ring "heartbeat" connection still is active both hosts
can share their knowledge of the cluster. In the result both pacemaker
nodes remain in their current state. This is the expected as well as the
desired behavior...

.. but would it not better if Pacemaker would be aware of the no longer
functional drbd? Should it not at least complain?

It is clear that a split brain could and should not be dealed with
automagically. So pacemaker cannot solve this condition by itself. But
it may lead some admins to some terrible wrong  decisions if they look
only in the pacemaker status - since the status seams completely OK on
both nodes. One of such decisions would be initiating a failover.

* Did we do any stupid error or is this behavior intended?

* One way to deal with this situation is clearly to monitor the drbd by
nagios. And we will do this.

* Another way would be to include the DSL line into the bond. But due to
the very different nature of these lines e.g. in terms of latency this
would be a delicate setup. Surely one can priorise the individual links
in the bond and define active and passive members. In additionally this
setup would violate one of our guiding principles in high availability:
*No admin issuing a single command should be able to produce a split
brain*. In this case of the bonding the three lines this command would
be "ifdown bond0". Now you understand why we have the DSL line at all
:-). For not geographically seperated setups we use additional serial
lines to achive our guiding priciple.

Looking forward to the discussion.

Cheers,

Volker

-- 
=========================================================
   inqbus Scientific Computing    Dr.  Volker Jaenisch
   Richard-Strauss-Straße 1       +49(08861) 690 474 0
   86956 Schongau-West            http://www.inqbus.de
=========================================================





More information about the drbd-user mailing list