[DRBD-user] drbd with heartbeat won't fail over

Mon Jun 18 08:29:12 CEST 2007

On Thu, Jun 14, 2007 at 01:54:58PM -0400, Dan Gahlinger wrote:
> Lars,
> 
> we found the "fly in the ointment" . we know why it fails, but no idea how to
> fix it.
> Take this basic setup:
> 
> test1 and test2 running drbd and heartbeat.
> drbd is on a cross-over cable between the two.
> heartbeat is on the public interface.
> test1 is primary (for the sake of argument)
> 

for heartbeat to do its heartbeats you should use every communication
channel available. you should definitely use the drbd replication link
as heartbeat comm channel, too.

> unplug the public ethernet interface from test1.
> Nothing changes.
> 
> test2 cannot become primary. it is impossible.
> test1 is already primary.
> drbd connection is active.
> 
> heartbeat attempts to run a drbddisk r0 start on test2
> which is physically impossible, because drbd is already running.
> test2 never gets the virtual ip resource (though I'm not sure why).
> the debug log says "success" but it doesn't actually do it.
> running the command manually for the virtual ip works ok though.
> 
> heartbeat would need to do the following for this to work properly:
> 1. don't attempt to start drbd - this will never work
> 2. do an unmount of the drbd filesystems on test1
> 3. do a drbdadm secondary on test1
> 4. then do a drbdadm primary all on test2
> 
> I'm not even sure this is possible.

there are several options:
 * stonith
   when heartbeat detects one box to be dead,
   it would switch it off using a power switch,
   just to be sure -- because it might be not dead, after all,
   it may be "only" a complete loss of communications...

 * when you have multiple comm channels, and want to trigger a
   switchover of services when the outside connectivity on the active
   node breaks, the concept of "ping nodes" or groups thereof helps.

   choose your ping nodes (and timeouts!) wisely to match your situation
   (network and outside connectivity), ping nodes should be highly
   available themselves (chose the upstream router/switch combo,
   chose the first hop of the provider network, something like that).
   otherwise you could get spurious failover/failback.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.