[DRBD-user] Layering drbd8 and LVM2 to minimize split-brain

Sun Oct 14 19:52:15 CEST 2007

Hi all,

I am (still) trying to build a prototype high-availability cluster for Xen 
instances. One (of the various) major issue(s) is the shared storage for 
holding the Xen backend devices. Requirements for this storage are:
1. that is is shared (well, it's a cluster...)
2. that both nodes can access any of the backends at any time (although not 
necessarily concurrently)
3. that it doesn't need manual intervention

So the question is how to build such a shared storage with drbd8 as the 
underlying mechanism. A few possibilities come to mind:

a) Having drbd8 in primary/primary mode with a cluster file system like ocfs2 
on top of it and loopback file backends for the Xen domains. Been there, done 
that, wasn't stable at all (kernel oopses left and right) and very much 
violated requirement 3 due to split-braining on the drbd8 _and_ ocfs2 levels.

b) Having two drbd8 volumes, both in primary/secondary mode, with each node 
being primary for one of them. On top of the volumes using LVM2 to split into 
those Xen domUs that are normally active on the respective node. Nice idea, 
but unfortunately violates requirement 2, because each node can only run 
those domU that are in its current primary set. That is, failover is 
all-or-nothing and load balancing/dynamic migration/extensibility are not 
possible or severely limited.

c) Using drbd8 volumes as the Xen backend devices, on top of LVM2. That is, 
both cluster nodes hold one (or two, the second one for swap) LV(s) for each 
Xen domU and those are synced with one drbd8 primary/secondary volume each. 
Management effort is high, but can be taken care of with appropriate 
scripting. This beautifully fulfills all requirements in theory, but having 
run a construction like that in practice for the past around 6 months, I have 
to say that this as well is not stable and thus violates requirement 3. 
The problem here is not split-brain (each drbd volume only ever has one 
primary, the one that's currently running the domU), but more general and 
very hard to trace hick-ups. The nodes tend to lock up, trash extensively 
(without much user-space processing going on), and basically need a reboot 
(and manual intervention to get the drbd8 volumes to re-sync) about every two 
weeks. In theory nice, in practice it seems a no-go. 
The suspected reason is that drbd(8) just doesn't cope properly with around 30 
or more active volumes. I tried several hints from this list (thanks to Lars, 
among others), including to severely limit sync speed, buffer sizes, and 
serialize syncing with "after" statements, and although they improved 
matters, the system still isn't stable.

d) Using a single drbd8 volume in primary/primary mode with (C)LVM on top of 
it. Only one node will modify each LV at any given time, so writes to the 
drbd8 volume are disparate considering blocks. 

The last variant _should_ be able to fulfill all requirements, but I am unsure 
if drbd8 can avoid split-braining in this case. Is there any way to use the 
split-brain resolving mechanisms in a way that is both safe (i.e. to not 
overwrite changes) and automatic (i.e. to never go into a split-brain where 
an admin needs to manually resolve matters)? Theoretically, this should be 
pretty simple:
- Even when network problems happen and the nodes go split-brain (on 
the "higher" level of domUs still executing independently), they will not 
modify the same blocks on the drbd8 volume (as they only use different LVs). 
Starting the same domU on both nodes in a drbd8 split-brain situation can be 
easily and cheaply avoided by using multiple heartbeat connections.
- When modifying shared LVM data (i.e. adding, removing, or resizing LVs), 
CLVM should take care that, again, only one host is modifying the LVM 
metadata.

So the real question after all the rambling (which, I hope, is to the benefit 
of readers to discuss some of the variants) is: how smart can drbd8 be made 
in terms of automatically resolving split-brain? Will it always go 
split-brain when _any_ blocks are changed on both sides, or only if the 
_same_ blocks are changed? The drbd.conf manual page descriptions of the 
after-sb-* options seems to indicate that automatically resolving split-brain 
is only possible on a volume but not on a block level. Is there a way to make 
it work on block level, or am I missing something terribly obvious here?

best regards,
Rene

-- 
-------------------------------------------------
Gibraltar firewall       http://www.gibraltar.at/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20071014/4a16fc8f/attachment.pgp>