[DRBD-user] ocf:linbit:drbd: DRBD Split-Brain not detected in non standard setup

Lars Ellenberg lars.ellenberg at linbit.com
Sun Feb 26 19:39:56 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Sat, Feb 25, 2017 at 06:03:00PM +0100, Dr. Volker Jaenisch wrote:
> Hi All!
> OMG I am so stupid!
> * Feb 25 16:53:27 mail2 kernel: [11901363.518368] drbd r0: bind before
> listen failed, err = -99
> Without an interface to bind to drbd cannot listen!
> @All: Never do a network failure test with DRBD by shutting down
> interfaces in your linux box. Use iptables or pull the cords or shut
> down your cisco ports.

(Or realize the implications)

> So only one thing still stands out:
> > But this is not what we like to happen in this case. In the case of
> > communication breakdown of DRBD but still a connection between the
> > corosync nodes, we would like the cluster nodes :
> > 1) to remain in their state,
> > 2) prevent DRBD from failover,
> > 3) Indicate that the DRBD connection is broken
> > 4) wait for reestablishing of the connection and resync the drbd after,
> > 5) allow failover again.
> From this fairly long wishlist in our case:
> 1) Works
> 2) Works (Rule prevents failover)
> 3) Works not

>From the Pacemaker PoV, all is good...

Pacemaker won't indicate that the 3rd disk
of your 8-disk RAID 6 is broken, either.

But yes, I argued your point of view before myself as well.

crm_mon -L will list negative location constraints
("xyz prevents foo from running on bla").
DRBD and complaints like yours were the reason I implemented that.

Also pacemaker learned about "degraded, but OK" monitoring
status return codes (OCF_DEGRADED, OCF_DEGRADED_MASTER),
intended purpose exactly only to "allow crm_mon to flash warning lights"
(or any other consumer of the status section).
Main reason for introducing these was, again, DRBD,
and complaints like yours.

We just don't use those exit codes in our RA yet.

Nor does anyone else, afaik. It has not even made it into
ocf-returncodes in resource-agents yet.

Probably, because they don't work, yet:
even though CRMD and the policy engine and later consumers
would treat them OK-ish (I think),
LRMD choses to "filter" exit codes in ocf2uniform(),
"just in case", so they will never make it back to CRMD

I may have forgotten to file an issue there :-/

> 4) Works
> 5) Works
> I will open another thread for this last issue.
> Many thanks for all of you. Sorry for stealing your time


: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
please don't Cc me, but send to list -- I'm subscribed

More information about the drbd-user mailing list