[DRBD-user] Corosync Configuration

Thu Jun 14 02:02:37 CEST 2012

On 06/13/2012 05:56 PM, William Seligman wrote:
> On 6/13/12 11:45 AM, Arnold Krille wrote:
>> On Wednesday 13 June 2012 09:26:45 Felix Frank wrote:
>>> On 06/12/2012 08:23 PM, Dennis Jacobfeuerborn wrote:
>>>>> Don't use crossover cables.. In my experience use crossover cables for
>>>>> two
>>>>>
>>>>>> node cluster make only problems... use a simple switch..
>>>>
>>>> Why would a setup with 2 cables and a switch be more reliable than just a
>>>> single cable? That doesn't make sense.
>>>
>>> Uhm, I don't think that was Eduardo was suggesting.
>>>
>>> Someone on this list (Digimer?) has made a good point some time about
>>> switches allowing for better forensics in case of link problems (i.e.,
>>> the switch can help you identify the side with a faulty NIC/cable).
>>>
>>> On the other hand, a switch introduces one more (two really, counting
>>> the extra required cable) possible point of replication failure. I've
>>> never had negative experiences with back-to-back connections either.
>>
>> A switch also has input- and output-buffers introducing another step of 
>> latency.
>>
>> We would use a direct link-cable if that scaled for more then 2 servers. (We 
>> actually though about just adding more network-cards, connect three servers 
>> with three cables directly and use bridges with (r)stp for the storage ring. 
>> But now we will just use the additional cards for more redundancy to have 
>> trunked connections to two switches...)
> 
> A data point:
> 
> On my cluster, I have two dedicated direct-link cables between the two nodes,
> one for DRBD traffic, the other for corosync/pacemaker traffic. Roughly once per
> week, I get a "link down" messages on one of the nodes:
> 
> Jun 12 09:39:33 orestes-corosync kernel: igb: eth1 NIC Link is Down
> Jun 12 09:39:33 orestes-corosync kernel: igb: eth3 NIC Link is Down
> 
> The cluster responds by STONITHing (rebooting) the other node. Everything comes
> up fine, and all the resources continue to be available (though some VMs get
> rebooted, which is mildly annoying).
> 
> eth0 and eth2, which are connected to switches, don't have this problem; eth1 is
> on the motherboard while eth3 is on an expansion card; only one node has the
> error. This makes it difficult to diagnose.
> 
> It's not a big deal, but it does contribute to the idea that perhaps using an
> intermediate switch increases the chance of a reliable connection, at the
> obvious cost of an additional mode of failure.

With how many systems have you tested this i.e. how large was your sample
size? An anecdote cannot be generalized and you might as well started of
with a switch then run into problems and fixed it by using a direct connection.

In order to prefer one case over the other in general one has to provide
sound reasoning *why* exactly one case is inherently better.

Regards,
  Dennis