On 13.06.2012 17:56, William Seligman wrote:
> A data point:
> On my cluster, I have two dedicated direct-link cables between the two nodes,
> one for DRBD traffic, the other for corosync/pacemaker traffic. Roughly once per
> week, I get a "link down" messages on one of the nodes:

A) use several communication-rings in corosync. We use one on the
regular user-network and a second on the storage-network. One fails, no
problem, corosync doesn't sense a need to fence something.
B) use bonded/bridged interfaces for the storage-connection. We
currently have our storage-network aka vlan17 as vlan on eth0 of all the
servers and untagged on eth1, using a bond with active-backup mode where
eth1 is the primary and vlan17 the backup.

With these two I didn't even realize my boss unplugged the
network-cables of one of our servers one by one. Neither did drbd feel
any glitch nor did the cluster feel a need to move/kill/fence anything.
And a 5 second hang for the x2go-sessions on one of the machines doesn't
matter when everyone is on break.

I haven't yet figured out how to build the bridges/bonds when all the
servers have 4 nics. But that isn't a real problem until I also did
functionality tests with two (or three) new switches.
I think I will do one bridge of two ports with rstp for the normal user
network and one bridge of two ports with rstp for the storage-network.
Then skip the active-backup bonding and see that rstp manages to find
the paths. Of course this wouldn't necessarily improve throughput
between two nodes, but throughput from one node to two nodes would
probably be higher.
Or I extend my current setup and instead of eth0 and eth1 I use one pair
of bonded ports each. Which would give me a total of three bonds per
server, two one of the 'real' modes and one in active-backup mode...

Well, lets see.

