Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, a bit late posting because of the time zone, but ... On Wed, 12 Oct 2011 14:45:54 -0400, Digimer <linux at alteeve.com> wrote: > On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote: >> Digimer, >> >> Thanks again for holding my hand on this. I've already started reading >> your wiki posts. I wish Google gave your site a better ranking. I've been >> doing research for months, and your articles (especially comments in the >> config files) are very helpful. > > Happy it helps! Linking back to it might help. ;) > >>> Also note that I had either leg of the bond routed through different >>> switches. I had tried stacking them (hence the ability to LAG) but ran >>> into issue there as well. So now for HA networking I use two independent >>> switches, with a simple uplink between the switches, and mode=1. This >>> configuration has tested very reliable for me. >> >> I am using a single M4900 switch due to project budget issues right now. >> Once we go further toward production I intend to use two stacked M4900 >> switches. For now LACP hasn't been a problem. I will test with stacked >> M4900s and get back to you with my results. > > Consider the possibility that you might one day want/need Red Hat > support. In such a case, not using mode=1 will be a barrier. Obviously > your build is to your spec, but do please carefully consider mode=1 > before going into production. > I am using LACP (mode=4) with stacked switches without problems, but Digimer is right about the support barrier >>> Fencing is handled entirely within the cluster (cluster.conf). I use >>> Lon's "obliterate-peer.sh" script as the DRBD fence-handler. When DRBD >>> sees a split-brain, it blocks (with 'resource-and-stonith') and calls >>> 'fence_node<victim>' and waits for a successful return. The result is >>> that, on fault, the node gets fenced twice (once from the DRBD call, >>> once from the cluster itself) but it works just fine. >> >> Great explanation. Thanks! >> 'resource-and-stonith' is the key here - multipath will retry the failed requests on the surviving node _after_ it resumes IO >> >>>> 4. Replicated DRBD volume with GFS2 mounted over GNBD >> >>> No input here, sorry. >> >> See below. >> >>>> 5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET) >> >> >> So my setup looks like this: >> >> DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2 >> while setting up the cluster i have also tried GNBD, but switched to iSCSI (IET), because it allows importing the device locally too, which is not possible with GNBD. With such setup it is possible to use the same (multipath) name for the device instead of drbdX on the local machine to avoid deadlocks. The resulting setup is: LVM->DRBD (pri/pri)->iSCSI->multipath->gfs2 >> I skipped clvmd because I do not need any of the features of LVM. My >> RAID volume is 4.8TB. We will replace equipment in 3 years, and in most >> aggressive estimates we will use 2.4TB at most within 3 years. >> The use of LVM (not CLVM in my case) comes handy for the backups - you can snapshot a volume and mount with local locking = much faster without DLM overhead and iSCSI/DRBD being involved Now back to your original questions: > 1. In the case of a 2-primary split brain (switch hiccup, etc), I would like server #1 to always remain primary and server #2 to always shut down. I would like this behavior because server #2 can't become secondary because GNBD is not going to release it. What is the best way to accomplish this? use 'resource-and-stonith', then modify your 'fence-peer' handler to sleep on the second server. As a handler you may use obliterate-peer.sh or the one i have posted to this list a week ago > 2. I've tried the deadline queue manager as well as CFQ. I've noticed no difference. Can you please elaborate on why deadline is better, and how can I measure any performance difference between the two? just something i have observed: if you start writing a few gigabyte file, with CFQ after some time the IO for the entire GFS stops for few seconds and even small request from other nodes are blocked, which does not happen with deadline > 3. It seems that GNBD is the biggest source of latency in my system. It decreases IOPS by over ~50% (based on DD tests compared to the same DRBD based GFS2 mounted locally). I've also tried Enterprise iSCSI target as an alternative and the results were not much better. The latency on my LAN is ~0.22ms. Can you offer any tuning tips? yes, even if iSCSI (in my case) is connected via loopback interface there is a performance impact. You may finetune your iscsi client (open-iscsi in my case) and multipath for your usage case (check queue depth / data segment for iscsi and rr_min_io for multipath), you should also use jumbo frames if possible, but it will still be slower than a direct attached disks. A test case involving the network latency is if DRBD is primary and connected, but in diskless mode, so all reads and writes will go to the remote node - you will probably get near the same performance like when using GNBD/iSCSI (your DD tests 4 and 5)