[DRBD-user] Corosync Configuration

Thu Jun 14 04:32:18 CEST 2012

On 6/13/12 8:02 PM, Dennis Jacobfeuerborn wrote:
> On 06/13/2012 05:56 PM, William Seligman wrote:
>> On 6/13/12 11:45 AM, Arnold Krille wrote:
>>> On Wednesday 13 June 2012 09:26:45 Felix Frank wrote:
>>>> On 06/12/2012 08:23 PM, Dennis Jacobfeuerborn wrote:
>>>>>> Don't use crossover cables.. In my experience use crossover cables for
>>>>>> two
>>>>>>
>>>>>>> node cluster make only problems... use a simple switch..
>>>>>
>>>>> Why would a setup with 2 cables and a switch be more reliable than just a
>>>>> single cable? That doesn't make sense.
>>>>
>>>> Uhm, I don't think that was Eduardo was suggesting.
>>>>
>>>> Someone on this list (Digimer?) has made a good point some time about
>>>> switches allowing for better forensics in case of link problems (i.e.,
>>>> the switch can help you identify the side with a faulty NIC/cable).
>>>>
>>>> On the other hand, a switch introduces one more (two really, counting
>>>> the extra required cable) possible point of replication failure. I've
>>>> never had negative experiences with back-to-back connections either.
>>>
>>> A switch also has input- and output-buffers introducing another step of 
>>> latency.
>>>
>>> We would use a direct link-cable if that scaled for more then 2 servers. (We 
>>> actually though about just adding more network-cards, connect three servers 
>>> with three cables directly and use bridges with (r)stp for the storage ring. 
>>> But now we will just use the additional cards for more redundancy to have 
>>> trunked connections to two switches...)
>>
>> A data point:
>>
>> On my cluster, I have two dedicated direct-link cables between the two nodes,
>> one for DRBD traffic, the other for corosync/pacemaker traffic. Roughly once per
>> week, I get a "link down" messages on one of the nodes:
>>
>> Jun 12 09:39:33 orestes-corosync kernel: igb: eth1 NIC Link is Down
>> Jun 12 09:39:33 orestes-corosync kernel: igb: eth3 NIC Link is Down
>>
>> The cluster responds by STONITHing (rebooting) the other node. Everything comes
>> up fine, and all the resources continue to be available (though some VMs get
>> rebooted, which is mildly annoying).
>>
>> eth0 and eth2, which are connected to switches, don't have this problem; eth1 is
>> on the motherboard while eth3 is on an expansion card; only one node has the
>> error. This makes it difficult to diagnose.
>>
>> It's not a big deal, but it does contribute to the idea that perhaps using an
>> intermediate switch increases the chance of a reliable connection, at the
>> obvious cost of an additional mode of failure.
> 
> With how many systems have you tested this i.e. how large was your sample
> size? An anecdote cannot be generalized and you might as well started of
> with a switch then run into problems and fixed it by using a direct connection.
> 
> In order to prefer one case over the other in general one has to provide
> sound reasoning *why* exactly one case is inherently better.

You are correct; my sample size is one, which proves nothing. I offered
the anecdote so that other folks can add it to any other anecdotes
upthread, downthread, or elsewhere and perhaps come to a general
conclusion.

-- 
Bill Seligman             | mailto://seligman@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137                |
Irvington NY 10533  USA   | Phone: (914) 591-2823

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4497 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120613/9a861d8e/attachment.bin>