[DRBD-user] [PATCH] dopd should notify when peer is dead (was "Refusing to be Primary while peer is not outdated" when peer is dead?)

Wed Feb 27 17:00:48 CET 2008

Hi again,

I now have a better understanding of the issue I posted in drbd-user
yesterday, which seems to be a bug in dopd/drbd-peer-updater, hence I'm
posted this mail in the dev list.

Note that dopd works fine in the other failover scenario (the other node
is still alive and can be contacted by other means).

On Tue, 2008-02-26 at 17:50 +0100, Brice Figureau wrote:
> I'm doing some failover tests of a passive/active mysql over drbd
> configuration.
> The current setup uses drbd 8.0.8 with heartbeat 2.1.3 (V2 crm style,
> drbddisk RA).
> 
> One of my failover scenario involves AC unpluging the current active
> node and see if the passive node is promoted. 
> Unfortunately for me it fails when the soon to become active node starts
> tries to promote drbd in primary mode.

Here is the failing scenario:

* I unplug the master

the slave gets:
drbd0: PingAck did not arrive in time.
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk( UpToDate -> DUnknown )
drbd0: asender terminated
drbd0: Terminating asender thread
drbd0: short read expecting header on sock: r=-512
drbd0: Writing meta data super block now.
drbd0: tl_clear()
drbd0: Connection closed
drbd0: conn( NetworkFailure -> Unconnected )
drbd0: receiver terminated
drbd0: receiver (re)started
drbd0: conn( Unconnected -> WFConnection )

* The slave's drbd notices it immediatly and launch the outdate-peer
helper:
 drbd0: helper command: /sbin/drbdadm outdate-peer

* Which launches "/usr/lib/heartbeat/drbd-peer-updater", which in turns
contact dopd with the peer's name and resource.

* Dopd connects to the crm only to see that the node is completely dead
(since it has been abruptly shutdowned). Dopd then returns 20 to the
client (see line 311 of dopd.c)

* drbd-peer-updater gets the 20, and aborts

* the drbd module gets the 20 return code and thinks drbd-peer-updater
is broken. Thus it doesn't mark the peer as outdated.

* Meanwhile, heartbeat notices it has to start the resources on the
slave soon to be primary node.

* Unfortunately that operation fails, because: "drbdsetup /dev/drbd0
primary" failed with the "Refusing to be Primary while peer is not
outdated" error message.

So what's the point to have an high-available cluster that can't survive
the death of one node?

I think that there should be a special handling of dead peers in dopd.c
that should return 5 (so that drbd knows the other node is dead).

Also in dopd.c the check_drbd_peer() function seems to be highly
suspect. It won't loop until it finds a matching node if there are some
dead nodes in-between...

So here is a dopd patch fixing this issue. I only slightly tested it
with the above scenario (with good results), so use at your own risks,
etc...
I'm not sure if I should send the patch on drbd-dev or on the linux-ha
lists, so I'm trying first here (the problem is drbd related). If I'm
wrong, please let me know and I'll send this mail on the linux-ha dev
list.

Many thanks,
-- 
Brice Figureau <brice+drbd at daysofwonder.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dopd.patch
Type: text/x-patch
Size: 2844 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080227/5d970c2a/attachment.bin>