[DRBD-user] DRBD: failover when sync connection dies?

Martin Gombac martin at isg.si
Wed Dec 19 15:58:33 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 2007.12.18, at 15:52, Lars Ellenberg wrote:

> On Mon, Dec 17, 2007 at 10:55:02AM +0100, Martin Gombac wrote:
>> On 2007.12.13, at 18:24, Martin Gombac wrote:
>>
>>> On 2007.12.13, at 17:42, Florian Haas wrote:
>>>
>>>> On Thursday 13 December 2007 13:45:47 Martin Gombac wrote:
>>>>> Hi,
>>>>>
>>>>> .....
>>>>> without synchronized drbd resources (amongst other things).
>>>>>
>>>>> My question is this:
>>>>> How can i make one node take over all resources if local crossover
>
> you don't want to.
But it do. :-)
See below why.

>
> if your lan connection dies,
> and your lan connection was your replication link,
> then you don't have replication anymore,
> and so you would go online with non-current data.
Couple of seconds old data would come up on the second node, i agree.  
But it's way better than scenario described below:
So both nodes have secondary/slave drbd resource out of sync. Since  
my replication link probably died due to a broken network card i have  
to take the node with broken card down. In this case the second  
(healthy) node would come up with _really_ out of date data for  
second resource or not at all if i used dopd. Which means we don't  
have a true fail-over cluster and would have unnecessary downtime.  
But we use clustering in the first place to avoid downtime.
(I also have other applications communicating over this link, but can  
easily make them use WAN.)

>
> if currently your LAN connection is a direct "crossover cable",
> why would you think any clients would benefit from failing over?
>
There would very little downtime (if it fails over to healthy node),  
no outdated drbd resources (oppose to one outdated if it doesn't  
failover) and as soon as i fix network card, synchronization would be  
auto-magic. Loosing a couple of seconds of data is in my opinion much  
better than having at least half hour downtime or more when i shut  
down broken server.

> if you change to a switched LAN, and add a ping node,
> why do you think any clients would benefit from that?
>
We would know on which node the network failed and on which it works  
so fail-over would be in the right direction. Later on i would fix  
the node with broken network card or whathever plug it back in and it  
would come back to cluster (sync and all). Clients would benefit by  
not having any downtime, like it would be using dopd.

> how can you be sure what component failed,
>  local NIC, cables, remote NIC, switch, driver, ...?
>
If the switch or ping node failed, both nodes would detect that and  
no fail-over would happen. I would replace switch and synchronization  
would start auto-magically. If there would be a failure on local  
network interface, cable or driver one node would still get the pings  
back so we would know on which side it failed and the healthy node  
would take over thanks to heartbeat.

> what problem are you trying to solve?
>   I mean not "failing over when the LAN link dies".
>   please zoom out a little.
>
To have as little downtime as possible if replication link fails. I  
think i explained this a couple of times by now. If data get's  
outdated on one node and the other has to be taken down for repairs  
=> service offline.

> from my point of view, it makes no sense to trigger a failover
> because the replication link dies. it would even be harmful.
> so don't do that.
>
What should i do then? Just take offline one node at the time (and  
half of the services with it) and take my time to repair it? Why do i  
even bother setting up a cluster then?

I have all my clusters set up till now like you described, no fail- 
over when sync link dies. But recently on one of my clusters I have  
had a problem i described (broken network card for one sync node) and  
it led to downtime. My client now states that, if fail-over in this  
case of a network card failure didn't happen, i have _not_ set them  
up fail-over cluster. I guess we all know how annoying clients can be  
at times. So for future installs and the one i'm doing now i want it  
to be bulletproof. :-)

> -- 
> : Lars Ellenberg                           http://www.linbit.com :
> : DRBD/HA support and consulting             sales at linbit.com :
> : LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
> : Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
> __
> please use the "List-Reply" function of your email client.
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user




More information about the drbd-user mailing list