Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi again, I now have a better understanding of the issue I posted in drbd-user yesterday, which seems to be a bug in dopd/drbd-peer-updater, hence I'm posted this mail in the dev list. Note that dopd works fine in the other failover scenario (the other node is still alive and can be contacted by other means). On Tue, 2008-02-26 at 17:50 +0100, Brice Figureau wrote: > I'm doing some failover tests of a passive/active mysql over drbd > configuration. > The current setup uses drbd 8.0.8 with heartbeat 2.1.3 (V2 crm style, > drbddisk RA). > > One of my failover scenario involves AC unpluging the current active > node and see if the passive node is promoted. > Unfortunately for me it fails when the soon to become active node starts > tries to promote drbd in primary mode. Here is the failing scenario: * I unplug the master the slave gets: drbd0: PingAck did not arrive in time. drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) drbd0: asender terminated drbd0: Terminating asender thread drbd0: short read expecting header on sock: r=-512 drbd0: Writing meta data super block now. drbd0: tl_clear() drbd0: Connection closed drbd0: conn( NetworkFailure -> Unconnected ) drbd0: receiver terminated drbd0: receiver (re)started drbd0: conn( Unconnected -> WFConnection ) * The slave's drbd notices it immediatly and launch the outdate-peer helper: drbd0: helper command: /sbin/drbdadm outdate-peer * Which launches "/usr/lib/heartbeat/drbd-peer-updater", which in turns contact dopd with the peer's name and resource. * Dopd connects to the crm only to see that the node is completely dead (since it has been abruptly shutdowned). Dopd then returns 20 to the client (see line 311 of dopd.c) * drbd-peer-updater gets the 20, and aborts * the drbd module gets the 20 return code and thinks drbd-peer-updater is broken. Thus it doesn't mark the peer as outdated. * Meanwhile, heartbeat notices it has to start the resources on the slave soon to be primary node. * Unfortunately that operation fails, because: "drbdsetup /dev/drbd0 primary" failed with the "Refusing to be Primary while peer is not outdated" error message. So what's the point to have an high-available cluster that can't survive the death of one node? I think that there should be a special handling of dead peers in dopd.c that should return 5 (so that drbd knows the other node is dead). Also in dopd.c the check_drbd_peer() function seems to be highly suspect. It won't loop until it finds a matching node if there are some dead nodes in-between... So here is a dopd patch fixing this issue. I only slightly tested it with the above scenario (with good results), so use at your own risks, etc... I'm not sure if I should send the patch on drbd-dev or on the linux-ha lists, so I'm trying first here (the problem is drbd related). If I'm wrong, please let me know and I'll send this mail on the linux-ha dev list. Many thanks, -- Brice Figureau <brice+drbd at daysofwonder.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: dopd.patch Type: text/x-patch Size: 2844 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080227/5d970c2a/attachment.bin>