[DRBD-user] drbd with heartbeat won't fail over

Mon Jun 18 16:07:08 CEST 2007

stonith won't help with what we're trying to test.

using both communications channels seems it won't solve this issue either.

we wan't heartbeat to fail drbd over if the main (or "public") interface is
down.
the machine may still be operational, but the network could be down - for a
number of reasons.

if we monitor both communications channels as you say, it'll never fail over
because the cross-over cable
for the drbd data never fails.

we currently have heartbeat monitoring the "public" ip of the other server.
we can't have the drbd data on the same interface - because that could end
up being too much traffic
and would limit our bandwidth to do any real work.

monitoring the cross-over (drbd) link is equally pointless. drbd manages
that itself quite well.
if we disable heartbeat for drbd monitoring, then we lack the ability to
umount and remount the partitions if servers fail.

it seems that in this instance heartbeat can only be used for a full server
failure (lost power).

Dan.

On 6/18/07, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
>
> On Thu, Jun 14, 2007 at 01:54:58PM -0400, Dan Gahlinger wrote:
> > Lars,
> >
> > we found the "fly in the ointment" . we know why it fails, but no idea
> how to
> > fix it.
> > Take this basic setup:
> >
> > test1 and test2 running drbd and heartbeat.
> > drbd is on a cross-over cable between the two.
> > heartbeat is on the public interface.
> > test1 is primary (for the sake of argument)
> >
>
> for heartbeat to do its heartbeats you should use every communication
> channel available. you should definitely use the drbd replication link
> as heartbeat comm channel, too.
>
> > unplug the public ethernet interface from test1.
> > Nothing changes.
> >
> > test2 cannot become primary. it is impossible.
> > test1 is already primary.
> > drbd connection is active.
> >
> > heartbeat attempts to run a drbddisk r0 start on test2
> > which is physically impossible, because drbd is already running.
> > test2 never gets the virtual ip resource (though I'm not sure why).
> > the debug log says "success" but it doesn't actually do it.
> > running the command manually for the virtual ip works ok though.
> >
> > heartbeat would need to do the following for this to work properly:
> > 1. don't attempt to start drbd - this will never work
> > 2. do an unmount of the drbd filesystems on test1
> > 3. do a drbdadm secondary on test1
> > 4. then do a drbdadm primary all on test2
> >
> > I'm not even sure this is possible.
>
> there are several options:
> * stonith
>    when heartbeat detects one box to be dead,
>    it would switch it off using a power switch,
>    just to be sure -- because it might be not dead, after all,
>    it may be "only" a complete loss of communications...
>
> * when you have multiple comm channels, and want to trigger a
>    switchover of services when the outside connectivity on the active
>    node breaks, the concept of "ping nodes" or groups thereof helps.
>
>    choose your ping nodes (and timeouts!) wisely to match your situation
>    (network and outside connectivity), ping nodes should be highly
>    available themselves (chose the upstream router/switch combo,
>    chose the first hop of the provider network, something like that).
>    otherwise you could get spurious failover/failback.
>
> --
> : Lars Ellenberg                            Tel +43-1-8178292-0  :
> : LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
> : Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
> __
> please use the "List-Reply" function of your email client.
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070618/f811c7c5/attachment.htm>