Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote: > Digimer, > > Thanks again for holding my hand on this. I've already started reading your wiki posts. I wish Google gave your site a better ranking. I've been doing research for months, and your articles (especially comments in the config files) are very helpful. Happy it helps! Linking back to it might help. ;) >> Also note that I had either leg of the bond routed through different switches. I had tried stacking them (hence the ability to LAG) but ran into issue there as well. So now for HA networking I use two independent switches, with a simple uplink between the switches, and mode=1. This configuration has tested very reliable for me. > > I am using a single M4900 switch due to project budget issues right now. Once we go further toward production I intend to use two stacked M4900 switches. For now LACP hasn't been a problem. I will test with stacked M4900s and get back to you with my results. Consider the possibility that you might one day want/need Red Hat support. In such a case, not using mode=1 will be a barrier. Obviously your build is to your spec, but do please carefully consider mode=1 before going into production. >> Fencing is handled entirely within the cluster (cluster.conf). I use Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD sees a split-brain, it blocks (with 'resource-and-stonith') and calls 'fence_node<victim>' and waits for a successful return. The result is that, on fault, the node gets fenced twice (once from the DRBD call, once from the cluster itself) but it works just fine. > > Great explanation. Thanks! > >> If you are using IPMI (or other out-of-band BMC), be sure to also setup a switched PDU as a backup fence device (like an APC AP7900). Without this backup fencing method, your cluster will hang is a node loses power entirely because the survivor will not be able to talk to the IPMI interface to set/confirm the node state. > > We are in an enterprise datacenter with two PDUs per rack, UPS, and generators. Also, the servers have two power supplies. So, I don't envision a power failure. The PDUs are owned and controlled by the infrastructure team, so IPMI is my only choice. I've seen faults in the mainboard, in the cable going from the PSU to the mainboard and other such faults take out a server. Simply assuming the power will never fail is unwise. Deciding you can live with the risk of a hung cluster, however, is a choice you can make. As for infrastructure restrictions; I deal with this by bringing in two of my own PDUs and running one off of either mains source (be it from another PDU, UPS or mains directly). Then I can configure and use the PDUs however I wish. >>> I've done the following DD tests: > >>> 1. Non-replicated DRBD volume with no FS > >> You mean StandAlone/Primary? > > Yes. > >>> 2. Replicated DRBD volume with no FS > >> So Primary/Primary? > > Yes. . > >>> 3. Replicated DRBD volume with GFS2 mounted locally > >> How else would you mount it? > > See below. > >>> 4. Replicated DRBD volume with GFS2 mounted over GNBD > >> No input here, sorry. > > See below. > >>> 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET) > >> Where does iSCSI fit into this? Are you making the DRBD resource a tgtd target and connecting to it locally or something? > > In #1 and #2, I used "dd if=/dev/zero of=/dev/drbd0 oflag=direct bs=512K count=1000000". Results were great (almost the same as writing directly to /dev/sdb, which is the backing store to DRBD). > > In #3, I used "mount -t gfs2 /dev/drbd0 /mnt" and then "dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were almost equally great (trivial performance loss). > > In #4 and #5, I used my two DRBD boxes as storage servers and exported the DRBD volume via GNBD and iSCSI, respectively. I then connected a 3rd node (via same 10GbE equipment) and imported the volumes onto said 3rd node (again via GNBD and iSCSI, respectively). I set up round-robin multipath, and then mounted them using "mount -t gfs2 /dev/mpath/mpath1 /mnt". Then I ran "dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were horrible (not even 50% compared to #1-3). > > > So my setup looks like this: > > DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2 > > I skipped clvmd because I do not need any of the features of LVM. My RAID volume is 4.8TB. We will replace equipment in 3 years, and in most aggressive estimates we will use 2.4TB at most within 3 years. > > > Thanks, > Mike Simplify the remote mount test... export the raw DRBD over iscsi over a simple, non-redundant 10Gbit link. Mount the raw space as a simple ext3 partition and test again. If that tests well, start putting pieces back one at a time. If it tests bad, look at your network config. As an aside, I've not use multipath because of warnings I got from others. This leaves me in a position where I can't rightly say why you shouldn't use it, but I'd try it without it. Simple Simple Simple. Get it working, start layering it up. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?"