Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I am (still) trying to build a prototype high-availability cluster for Xen instances. One (of the various) major issue(s) is the shared storage for holding the Xen backend devices. Requirements for this storage are: 1. that is is shared (well, it's a cluster...) 2. that both nodes can access any of the backends at any time (although not necessarily concurrently) 3. that it doesn't need manual intervention So the question is how to build such a shared storage with drbd8 as the underlying mechanism. A few possibilities come to mind: a) Having drbd8 in primary/primary mode with a cluster file system like ocfs2 on top of it and loopback file backends for the Xen domains. Been there, done that, wasn't stable at all (kernel oopses left and right) and very much violated requirement 3 due to split-braining on the drbd8 _and_ ocfs2 levels. b) Having two drbd8 volumes, both in primary/secondary mode, with each node being primary for one of them. On top of the volumes using LVM2 to split into those Xen domUs that are normally active on the respective node. Nice idea, but unfortunately violates requirement 2, because each node can only run those domU that are in its current primary set. That is, failover is all-or-nothing and load balancing/dynamic migration/extensibility are not possible or severely limited. c) Using drbd8 volumes as the Xen backend devices, on top of LVM2. That is, both cluster nodes hold one (or two, the second one for swap) LV(s) for each Xen domU and those are synced with one drbd8 primary/secondary volume each. Management effort is high, but can be taken care of with appropriate scripting. This beautifully fulfills all requirements in theory, but having run a construction like that in practice for the past around 6 months, I have to say that this as well is not stable and thus violates requirement 3. The problem here is not split-brain (each drbd volume only ever has one primary, the one that's currently running the domU), but more general and very hard to trace hick-ups. The nodes tend to lock up, trash extensively (without much user-space processing going on), and basically need a reboot (and manual intervention to get the drbd8 volumes to re-sync) about every two weeks. In theory nice, in practice it seems a no-go. The suspected reason is that drbd(8) just doesn't cope properly with around 30 or more active volumes. I tried several hints from this list (thanks to Lars, among others), including to severely limit sync speed, buffer sizes, and serialize syncing with "after" statements, and although they improved matters, the system still isn't stable. d) Using a single drbd8 volume in primary/primary mode with (C)LVM on top of it. Only one node will modify each LV at any given time, so writes to the drbd8 volume are disparate considering blocks. The last variant _should_ be able to fulfill all requirements, but I am unsure if drbd8 can avoid split-braining in this case. Is there any way to use the split-brain resolving mechanisms in a way that is both safe (i.e. to not overwrite changes) and automatic (i.e. to never go into a split-brain where an admin needs to manually resolve matters)? Theoretically, this should be pretty simple: - Even when network problems happen and the nodes go split-brain (on the "higher" level of domUs still executing independently), they will not modify the same blocks on the drbd8 volume (as they only use different LVs). Starting the same domU on both nodes in a drbd8 split-brain situation can be easily and cheaply avoided by using multiple heartbeat connections. - When modifying shared LVM data (i.e. adding, removing, or resizing LVs), CLVM should take care that, again, only one host is modifying the LVM metadata. So the real question after all the rambling (which, I hope, is to the benefit of readers to discuss some of the variants) is: how smart can drbd8 be made in terms of automatically resolving split-brain? Will it always go split-brain when _any_ blocks are changed on both sides, or only if the _same_ blocks are changed? The drbd.conf manual page descriptions of the after-sb-* options seems to indicate that automatically resolving split-brain is only possible on a volume but not on a block level. Is there a way to make it work on block level, or am I missing something terribly obvious here? best regards, Rene -- ------------------------------------------------- Gibraltar firewall http://www.gibraltar.at/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20071014/4a16fc8f/attachment.pgp>