[DRBD-user] Three node setup questions

Thu Feb 19 11:46:40 CET 2009

On Wed, Feb 18, 2009 at 03:44:15PM -0700, David.Livingstone at cn.ca wrote:
> Lars,
> 
> Thanks for the reply. See below.
> 
> > On Tue, Feb 17, 2009 at 03:52:16PM -0700, David.Livingstone at cn.ca wrote:
> > > Hello,
> > >
> > > I currently have two two-node clusters running heartbeat and
> > > drbd(see background below). I also have a two-node test which I
> > > decided to update to the latest releases of all. In so doing I
> > > downloaded and installed drbd 8.3.0(drbd-8.3.0.tar.gz) which
> > > includes three-node setups using stacked clusters. Specifically
> > > havng a third backup/brp node geograghically removed from our
> > > production cluster is very appealing.
> > >
> > > I have looked at the online manual(http://www.drbd.org/users-guide/)
> > > and read the current information for three-node setups and have some
> > > observations/questions :
> > > - An illustraion/figure of a three-node setup would help.
> 
> > there are several ways to to it.
> > you can also have four nodes: two two-node DRBD, the primary of which is
> > the "lower" resource of a "stacked" DRBD.
> 
> Are there some examples I can review somewhere ?

I'll try to make you an ascii art picture so you better see what should
be happening:

                                   ,---- replicated to 3rd
 |--------- FILE SYSTEM /data ---|/[*]
 |--------- drbd1 - on 1st ------|- drbd1-metadata|
 |--------- drbd0 - on 1st -----------------------| <==> replicated to 2nd

node Alpha and Bravo may change roles at anytime (for
failover/switchover), node Charlie will connect to the respective
"Primary" of those. Because of that, [*] has to be the "floating ip"
of the drbd cluster, which you wrote in your config file as the
ip in the stacked-on-top-of section.

for data to be replicated to all three nodes,
it has to go through both drbd1, which replicates to the 3rd node,
and drbd0 (where drbd1 passes its local writes to), which replicates
to the local respective "other" node.
thus our naming of "upper" and "lower" drbd: they are stacked.

as soon as drbd1 is active, you can no longer mount drbd0 (and you
should not, either, see above). you need to mount the "upper" drbd.

I suggest to have _two_ "floating cluster ips",
one for the drbd replication link to the 3rd node,
and one for any cluster services clients may connect to,
as they may be on different interfaces/network segments:

,------------------ local cluster --------------------------------.
| ,----- Alpha -----.                          ,----- Bravo -----.|
| |                 |- fixed-ip <==> fixed-ip -|                 ||
| |                 |                          |                 ||
| `-----------------´                          `-----------------´|
|                 [cluster ip 1]                                  |
|                        [cluster ip 2]                           |
`------------------ local cluster --------------------------------´

clients would connect to [cluster ip 1],
Charlie would connect its drbd to [cluster ip 2].

the drbd on Charlie may well be an "upper" drbd
replicating to a "lower" drbd on Charlie, which
would then replicate further to Delta.

in that case, you'd have a floating ip on Charlie and Delta
as well, and the upper DRBD would replicate from either
(Alpha or Bravo) to either (Charlie or Delta), or vice versa,
depending on which one is "active" in the lower/upper drbd.

if it is all local networks, you can have them all managed within
one pacemaker cluster, and the currently active "upper" DRBD (and
resources using it) may then be moved freely among all four nodes.
if you get the contraints right, that is.
use four such DRBD, and add preferential contraints, so they would
be equally distributed in normal operation.
you'd use protocol C (or maybe B) throughout, and do fully
automatic pacemaker controlled failovers.

if the replicatin link between both clusters is a WAN, you
probably do NOT want them all within one pacemaker cluster, and
you'd use protocol A (and potentially the DRBD Proxy; contact LINBIT)
on that link, and would do only semi-automatic, operator confirmed,
failover between sites.

hope that helps.

> > > Other Questions :
> > > - Is the manual available for download/printing ?
> 
> > No. We hand it out in training sessions, though.
> 
> Vienna sounds good ... now if I could convince my boss ...

we also do London and Berlin regularly.

I'm not sure about the North America, I don't think we have fix a
training schedule yet. But if you are interessted, and we get a
few "me too" from North America, we shall be able to arrange one.

> > > - Has anyone used the nx_lsa(Linx Sockets Acceration) driver
> > >   to run drbd ?
> 
> > I'm not exactly sure what that is supposed to do.
> 
> See http://www.netxen.com/technology/pdfs/Netxen_LinuxSocketsAcc_r3.pdf
> Essentially it implements a socket-level offload of the network subsystem 
> to a TCP stack running in firmware on the NIC. By using the nxoffload
> facility you can specify tcp ip, ports or applications to offload.

	"Linux simply cannot drive a 10-Gigabit Ethernet pipe using
	standard 1500 byte packets"

so what. use jumbo frames. use interrupt coalescing.

	"Another approach is to use a TCP offload engine (TOE). The Linux
	community has opposed this approach for several reasons,
	including rigid implementations and limited functionality"

uh?
then why have several (all?) 10 GB network drivers in the linux kernel
TOE functionality? which can be switched on and off using ethtool -K?

	blablah...
	And because LSA is implemented completely in firmware, it can
	accommodate TCP stack or Linux kernel changes, as required.

well, if that is true, then we should not even notice, right?
it should "just work".

	"In operation, LSA intercepts calls at the INET ops layer. Based
	on the rules defined by Selective Acceleration, the offload
	decision is taken on connect() for active connections and
	listen() and accept() for passive connections."

now, this sound more like a LD_PRELOAD hack?
in which case it would be of no use for DRBD,
because connections are established from kernel context.
no LD_PRELOAD there.

I guess you just have to try, and report back.

if it does not "just work", ask them whether they support connections
established from kernel context, and if/how code would have to be
modified to leverage this LSA stuff.  and then send a patch.

or get us to do the work for you

  ;)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed