[DRBD-user] A few configuration questions specific to RHEL 5 primary/primary GFS2 setup

Wed Oct 12 20:45:54 CEST 2011

On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
> Digimer,
>
> Thanks again for holding my hand on this. I've already started reading your wiki posts. I wish Google gave your site a better ranking. I've been doing research for months, and your articles (especially comments in the config files) are very helpful.

Happy it helps! Linking back to it might help. ;)

>> Also note that I had either leg of the bond routed through different switches. I had tried stacking them (hence the ability to LAG) but ran into issue there as well. So now for HA networking I use two independent switches, with a simple uplink between the switches, and mode=1. This configuration has tested very reliable for me.
>
> I am using a single M4900 switch due to project budget issues right now. Once we go further toward production I intend to use two stacked M4900 switches. For now LACP hasn't been a problem. I will test with stacked M4900s and get back to you with my results.

Consider the possibility that you might one day want/need Red Hat 
support. In such a case, not using mode=1 will be a barrier. Obviously 
your build is to your spec, but do please carefully consider mode=1 
before going into production.

>> Fencing is handled entirely within the cluster (cluster.conf). I use Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD sees a split-brain, it blocks (with 'resource-and-stonith') and calls 'fence_node<victim>' and waits for a successful return. The result is that, on fault, the node gets fenced twice (once from the DRBD call, once from the cluster itself) but it works just fine.
>
> Great explanation. Thanks!
>
>> If you are using IPMI (or other out-of-band BMC), be sure to also setup a switched PDU as a backup fence device (like an APC AP7900). Without this backup fencing method, your cluster will hang is a node loses power entirely because the survivor will not be able to talk to the IPMI interface to set/confirm the node state.
>
> We are in an enterprise datacenter with two PDUs per rack, UPS, and generators. Also, the servers have two power supplies. So, I don't envision a power failure. The PDUs are owned and controlled by the infrastructure team, so IPMI is my only choice.

I've seen faults in the mainboard, in the cable going from the PSU to 
the mainboard and other such faults take out a server. Simply assuming 
the power will never fail is unwise. Deciding you can live with the risk 
of a hung cluster, however, is a choice you can make.

As for infrastructure restrictions; I deal with this by bringing in two 
of my own PDUs and running one off of either mains source (be it from 
another PDU, UPS or mains directly). Then I can configure and use the 
PDUs however I wish.

>>> I've done the following DD tests:
>
>>> 1. Non-replicated DRBD volume with no FS
>
>> You mean StandAlone/Primary?
>
> Yes.
>
>>> 2. Replicated DRBD volume with no FS
>
>> So Primary/Primary?
>
> Yes. .
>
>>> 3. Replicated DRBD volume with GFS2 mounted locally
>
>> How else would you mount it?
>
> See below.
>
>>> 4. Replicated DRBD volume with GFS2 mounted over GNBD
>
>> No input here, sorry.
>
> See below.
>
>>> 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET)
>
>> Where does iSCSI fit into this? Are you making the DRBD resource a tgtd target and connecting to it locally or something?
>
> In #1 and #2, I used "dd if=/dev/zero of=/dev/drbd0 oflag=direct bs=512K count=1000000". Results were great (almost the same as writing directly to /dev/sdb, which is the backing store to DRBD).
>
> In #3, I used "mount -t gfs2 /dev/drbd0 /mnt" and then "dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were almost equally great (trivial performance loss).
>
> In #4 and #5, I used my two DRBD boxes as storage servers and exported the DRBD volume via GNBD and iSCSI, respectively. I then connected a 3rd node (via same 10GbE equipment) and imported the volumes onto said 3rd node (again via GNBD and iSCSI, respectively). I set up round-robin multipath, and then mounted them using "mount -t gfs2 /dev/mpath/mpath1 /mnt". Then I ran "dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were horrible (not even 50% compared to #1-3).
>
>
> So my setup looks like this:
>
> DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2
>
> I skipped clvmd because I do not need any of the features of LVM. My RAID volume is 4.8TB. We will replace equipment in 3 years, and in most aggressive estimates we will use 2.4TB at most within 3 years.
>
>
> Thanks,
> Mike

Simplify the remote mount test... export the raw DRBD over iscsi over a 
simple, non-redundant 10Gbit link. Mount the raw space as a simple ext3 
partition and test again. If that tests well, start putting pieces back 
one at a time. If it tests bad, look at your network config.

As an aside, I've not use multipath because of warnings I got from 
others. This leaves me in a position where I can't rightly say why you 
shouldn't use it, but I'd try it without it.

Simple Simple Simple. Get it working, start layering it up. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"