[DRBD-user] A few configuration questions specific to RHEL 5 primary/primary GFS2 setup

Wed Oct 12 19:49:58 CEST 2011

On 10/12/2011 11:41 AM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
> Hi,
>
> Thanks for the quick reply.
>
>> Red Hat cluster suite? If so, LACP isn't supported (only Active/Passive is for redundancy). This is aside your question though, of course.
>
> I am using RHEL Cluster Suite with minimal configs (node list and fencing only). I have my NICs in bonding mode 4. I am using IPMI fencing (on separate 1GbE NIC). I am using LACP for redundancy not any performance boost. Can you please explain how/why bonding mode makes a difference for RHEL CS?

You'd need to talk to your RH sales contact for details. If I was to 
guess though, it's probably a question of mode 1 (active/passive) 
outside a LAG is the most reliable fail/recovery mode. I know that when 
I was doing my own testing, I ran into recovery issues when using 
mode=4/lag.

Also note that I had either leg of the bond routed through different 
switches. I had tried stacking them (hence the ability to LAG) but ran 
into issue there as well. So now for HA networking I use two independent 
switches, with a simple uplink between the switches, and mode=1. This 
configuration has tested very reliable for me.

>> I'd suggest putting a delay in the second node's fence call. That way, in a true split brain, the primary will have a good head start in calling the fence against the backup node. However, delay to recovery when the primary really does fail will grow by the delay amount.
>
> This is my first time using primary/primary, GFS2, and RHEL CS, can you please explain in more detail how and where to do this? Are you talking about DRBD's fencing system, or RHEL CS fencing system, etc? Can DRBD handle this sort of fencing in the case of SB instead of relying on RHEL CS? Also, my nodes are round-robin multipathing. Won't adding a fence delay lead to data corruption?

This is a question with a very, very long answer. So long in fact that I 
wrote a tutorial covering this configuration:

https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial

That can more or less be applied directly to RHEL 6 / cman v3. There are 
some subtle differences but nothing insurmountable (I am working on a 
new version, but it won't be complete for a little while yet).

To answer you question briefly;

Storage;
raw disk -> drbd -> clvmd -> {gfs2, VM storage}

Fencing is handled entirely within the cluster (cluster.conf). I use 
Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD 
sees a split-brain, it blocks (with 'resource-and-stonith') and calls 
'fence_node <victim>' and waits for a successful return. The result is 
that, on fault, the node gets fenced twice (once from the DRBD call, 
once from the cluster itself) but it works just fine.

As an aside;

If you are using IPMI (or other out-of-band BMC), be sure to also setup 
a switched PDU as a backup fence device (like an APC AP7900). Without 
this backup fencing method, your cluster will hang is a node loses power 
entirely because the survivor will not be able to talk to the IPMI 
interface to set/confirm the node state.

>> There is overhead because of the distributed nature of clustered storage. However, I can't say where/why your latency is coming from so I don't have much to recommend at this time.
>
>> If you create a simple DRBD resource and test, what is the overhead relative to the bare drives underneath? How does that change when you add simple GFS2? How about if you used CLVMd as a (test) alternative? If the latency is fairly close between GFS2 and clvmd, it's possibly DLM overhead.
>
> I've done the following DD tests:
>
> 1. Non-replicated DRBD volume with no FS

You mean StandAlone/Primary?

> 2. Replicated DRBD volume with no FS

So Primary/Primary?

> 3. Replicated DRBD volume with GFS2 mounted locally

How else would you mount it?

> 4. Replicated DRBD volume with GFS2 mounted over GNBD

No input here, sorry.

> 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET)

Where does iSCSI fit into this? Are you making the DRBD resource a tgtd 
target and connecting to it locally or something?

> Results of #4 and #5 are dismal compared to #1,2, and 3. I would think that DLM would apply even to locally mounted GFS2 as I specified lock_dlm when creating FS.
>
> Thanks,
> Michael

I've done this;

2-nodes with DRBD in a simple 2-node RHCS v3 (on EL6) cluster exporting 
the raw DRBD resource via tgtd.

Second cluster of nodes that connect to the SAN cluster using iSCSI and 
then using that iSCSI device as a PV in a clustered LVM VG. The LVs 
created on this then back a GFS2 partition (plus LVs backing VMs). I 
didn't bench (it was a proof of concept) but I didn't notice painful 
disk performance, all things considered. Perhaps you can try something 
similar?

It would help if you could describe what problem you are trying to solve 
with your cluster. At this point, it's a lot of conjecture and guessing. 
With details of what you need, we might be able to offer more specific 
advice and ask more intelligent questions.

Cheers,

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"