[DRBD-user] A few configuration questions specific to RHEL 5 primary/primary GFS2 setup

Thu Oct 13 13:14:03 CEST 2011

Hi,
 a bit late posting because of the time zone, but ...

On Wed, 12 Oct 2011 14:45:54 -0400, Digimer <linux at alteeve.com> wrote:
> On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
>> Digimer,
>>
>> Thanks again for holding my hand on this. I've already started reading
>> your wiki posts. I wish Google gave your site a better ranking. I've
been
>> doing research for months, and your articles (especially comments in
the
>> config files) are very helpful.
> 
> Happy it helps! Linking back to it might help. ;)
> 
>>> Also note that I had either leg of the bond routed through different
>>> switches. I had tried stacking them (hence the ability to LAG) but ran
>>> into issue there as well. So now for HA networking I use two
independent
>>> switches, with a simple uplink between the switches, and mode=1. This
>>> configuration has tested very reliable for me.
>>
>> I am using a single M4900 switch due to project budget issues right
now.
>> Once we go further toward production I intend to use two stacked M4900
>> switches. For now LACP hasn't been a problem. I will test with stacked
>> M4900s and get back to you with my results.
> 
> Consider the possibility that you might one day want/need Red Hat 
> support. In such a case, not using mode=1 will be a barrier. Obviously 
> your build is to your spec, but do please carefully consider mode=1 
> before going into production.
> 

I am using LACP (mode=4) with stacked switches without problems, but
Digimer is right about the support barrier

>>> Fencing is handled entirely within the cluster (cluster.conf). I use
>>> Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD
>>> sees a split-brain, it blocks (with 'resource-and-stonith') and calls
>>> 'fence_node<victim>' and waits for a successful return. The result is
>>> that, on fault, the node gets fenced twice (once from the DRBD call,
>>> once from the cluster itself) but it works just fine.
>>
>> Great explanation. Thanks!
>>

'resource-and-stonith' is the key here - multipath will retry the failed
requests on the surviving node _after_ it resumes IO

>>
>>>> 4. Replicated DRBD volume with GFS2 mounted over GNBD
>>
>>> No input here, sorry.
>>
>> See below.
>>
>>>> 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET)
>>
>>
>> So my setup looks like this:
>>
>> DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2
>>

while setting up the cluster i have also tried GNBD, but switched to iSCSI
(IET), because it allows importing the device locally too, which is not
possible with GNBD. With such setup it is possible to use the same
(multipath) name for the device instead of drbdX on the local machine to
avoid deadlocks. The resulting setup is:

LVM->DRBD (pri/pri)->iSCSI->multipath->gfs2

>> I skipped clvmd because I do not need any of the features of LVM. My
>> RAID volume is 4.8TB. We will replace equipment in 3 years, and in most
>> aggressive estimates we will use 2.4TB at most within 3 years.
>>

The use of LVM (not CLVM in my case) comes handy for the backups - you can
snapshot a volume and mount with local locking = much faster without DLM
overhead and iSCSI/DRBD being involved

Now back to your original questions:

> 1. In the case of a 2-primary split brain (switch hiccup, etc), I would
like server #1 to always remain primary and server #2 to always shut down.
I would like this behavior because server #2 can't become secondary because
GNBD is not going to release it. What is the best way to accomplish this?

use 'resource-and-stonith', then modify your 'fence-peer' handler to sleep
on the second server. As a handler you may use obliterate-peer.sh or the
one i have posted to this list a week ago

> 2. I've tried the deadline queue manager as well as CFQ. I've noticed no
difference. Can you please elaborate on why deadline is better, and how can
I measure any performance difference between the two?

just something i have observed: if you start writing a few gigabyte file,
with CFQ after some time the IO for the entire GFS stops for few seconds
and even small request from other nodes are blocked, which does not happen
with deadline

> 3. It seems that GNBD is the biggest source of latency in my system. It
decreases IOPS by over ~50% (based on DD tests compared to the same DRBD
based GFS2 mounted locally). I've also tried Enterprise iSCSI target as an
alternative and the results were not much better. The latency on my LAN is
~0.22ms. Can you offer any tuning tips?

yes, even if iSCSI (in my case) is connected via loopback interface there
is a performance impact. You may finetune your iscsi client (open-iscsi in
my case) and multipath for your usage case (check queue depth / data
segment for iscsi and rr_min_io for multipath), you should also use jumbo
frames if possible, but it will still be slower than a direct attached
disks.

A test case involving the network latency is if DRBD is primary and
connected, but in diskless mode, so all reads and writes will go to the
remote node - you will probably get near the same performance like when
using GNBD/iSCSI (your DD tests 4 and 5)