Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 10/12/2011 11:41 AM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote: > Hi, > > Thanks for the quick reply. > >> Red Hat cluster suite? If so, LACP isn't supported (only Active/Passive is for redundancy). This is aside your question though, of course. > > I am using RHEL Cluster Suite with minimal configs (node list and fencing only). I have my NICs in bonding mode 4. I am using IPMI fencing (on separate 1GbE NIC). I am using LACP for redundancy not any performance boost. Can you please explain how/why bonding mode makes a difference for RHEL CS? You'd need to talk to your RH sales contact for details. If I was to guess though, it's probably a question of mode 1 (active/passive) outside a LAG is the most reliable fail/recovery mode. I know that when I was doing my own testing, I ran into recovery issues when using mode=4/lag. Also note that I had either leg of the bond routed through different switches. I had tried stacking them (hence the ability to LAG) but ran into issue there as well. So now for HA networking I use two independent switches, with a simple uplink between the switches, and mode=1. This configuration has tested very reliable for me. >> I'd suggest putting a delay in the second node's fence call. That way, in a true split brain, the primary will have a good head start in calling the fence against the backup node. However, delay to recovery when the primary really does fail will grow by the delay amount. > > This is my first time using primary/primary, GFS2, and RHEL CS, can you please explain in more detail how and where to do this? Are you talking about DRBD's fencing system, or RHEL CS fencing system, etc? Can DRBD handle this sort of fencing in the case of SB instead of relying on RHEL CS? Also, my nodes are round-robin multipathing. Won't adding a fence delay lead to data corruption? This is a question with a very, very long answer. So long in fact that I wrote a tutorial covering this configuration: https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial That can more or less be applied directly to RHEL 6 / cman v3. There are some subtle differences but nothing insurmountable (I am working on a new version, but it won't be complete for a little while yet). To answer you question briefly; Storage; raw disk -> drbd -> clvmd -> {gfs2, VM storage} Fencing is handled entirely within the cluster (cluster.conf). I use Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD sees a split-brain, it blocks (with 'resource-and-stonith') and calls 'fence_node <victim>' and waits for a successful return. The result is that, on fault, the node gets fenced twice (once from the DRBD call, once from the cluster itself) but it works just fine. As an aside; If you are using IPMI (or other out-of-band BMC), be sure to also setup a switched PDU as a backup fence device (like an APC AP7900). Without this backup fencing method, your cluster will hang is a node loses power entirely because the survivor will not be able to talk to the IPMI interface to set/confirm the node state. >> There is overhead because of the distributed nature of clustered storage. However, I can't say where/why your latency is coming from so I don't have much to recommend at this time. > >> If you create a simple DRBD resource and test, what is the overhead relative to the bare drives underneath? How does that change when you add simple GFS2? How about if you used CLVMd as a (test) alternative? If the latency is fairly close between GFS2 and clvmd, it's possibly DLM overhead. > > I've done the following DD tests: > > 1. Non-replicated DRBD volume with no FS You mean StandAlone/Primary? > 2. Replicated DRBD volume with no FS So Primary/Primary? > 3. Replicated DRBD volume with GFS2 mounted locally How else would you mount it? > 4. Replicated DRBD volume with GFS2 mounted over GNBD No input here, sorry. > 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET) Where does iSCSI fit into this? Are you making the DRBD resource a tgtd target and connecting to it locally or something? > Results of #4 and #5 are dismal compared to #1,2, and 3. I would think that DLM would apply even to locally mounted GFS2 as I specified lock_dlm when creating FS. > > Thanks, > Michael I've done this; 2-nodes with DRBD in a simple 2-node RHCS v3 (on EL6) cluster exporting the raw DRBD resource via tgtd. Second cluster of nodes that connect to the SAN cluster using iSCSI and then using that iSCSI device as a PV in a clustered LVM VG. The LVs created on this then back a GFS2 partition (plus LVs backing VMs). I didn't bench (it was a proof of concept) but I didn't notice painful disk performance, all things considered. Perhaps you can try something similar? It would help if you could describe what problem you are trying to solve with your cluster. At this point, it's a lot of conjecture and guessing. With details of what you need, we might be able to offer more specific advice and ask more intelligent questions. Cheers, -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?"