[DRBD-user] Promote secondary to primary fails in some situations

Tue Aug 28 10:43:50 CEST 2012

Hi,

this is old stuff, so I hope it's even still relevant. Some remarks inline.

On 08/06/2012 08:39 PM, Nik Martin wrote:
> 2 Storage Servers, named SAN-n1 and SAN-n2, with the following config:
> ...
>     on san01-n1 {

Uhm?

> First, my CIB:

It's a pacemaker config, but all the better!

> primitive ip_mgmt ocf:heartbeat:IPaddr2 \
>     params ip="172.16.5.10" cidr_netmask="24" \
>     op monitor interval="10s"

Hmm, wouldn't you want to specifiy on which NICs these should be bound?

> res_iSCSITarget_p_iscsitarget res_iSCSILogicalUnit_1 \
>     meta target-role="started"

I'm somewhat unfamiliar with this syntax.

I usually avoid target roles, though. On occasion, I observed pacemaker
refusing to fail over on failures because it wouldn't stop such
resources first, which was a requirement though.

This might be your problem.

> location cli-standby-rg_vmstore rg_vmstore \
>     rule $id="cli-standby-rule-rg_vmstore" -inf: #uname eq san01-n2

You really, really want to unmigrate/unstandby(?) this resource if you
want failover to work at all. This is definitely a problem.

All in all, your config looks fine, though.

> The failure mode which is not being handled properly by this
> configuration is the failure of the storage network interface on the
> Primary (relative to DRBD) node. DRBD communicates over the xover
> connection, so it remains in state Connected:Primary/Secondary if the
> network interface on the primary server dies.  What I see on the
> secondary unit is an error promoting drbd to primary, statting that
> there can only be pone primary.  So my question is how do I demote the
> primary to secondary when this failure mode occurs on either SAN Node? I
> don't see any demotion logic in the DRBD resource agent.

There is definitely demotion logic.

I suspect parts of your configuration (see above) prevent the CRM on the
failed node from stopping DRBD here. Please try and clean those up, then
repeat.

In addition: If you want failover to occur when your services are no
longer reachable, you may need to add ping resources that indicate to
(pacemaker on) your SAN that (even though the IPs are still available),
connectivity has been lost (e.g. broken cables, switch ports etc.).

HTH,
Felix