[DRBD-user] OCFS2/GFS+DRBD for HUGE Partition 30TB.

Fri Sep 9 22:09:00 CEST 2011

Hi Robert,

On 09/09/11 17:42, Robert Krig wrote:
> 
> I'm currently in the process of implementing a a new storage cluster for
> my employer.
> 
> We're looking at about 30TB of mirrored Storage on two storage servers.
> The intention is to expand this sometime in the near future, so it could
> be that in one or two years we're looking at 40-60TB of storage.
> 
> Our storage is meant for simple storage space of files which are in the
> range of 10-250MB on average.
> The idea is to run the DRBD set up in dual-primary node, so that uploads
> to one or the other node are synchronised and the data set is always
> consistent. Right now we are more concerned with redundancy rather than
> load balancing. But I figure that a load balanced set up now will save
> us some headaches in the future, once capacities extend the capabilities
> of a single node.

I humbly suggest that's a questionable approach. The only thing that
would benefit from a shared cluster file system would be a parallelized
application accessing the file system concurrently, and I doubt you
actually have one of those. I dare say in your setup dual-Primary will
cause you more headaches than benefits.

In addition, "once capacities [exceed] the capabilities of a single
node", they're soon going to exceed those of another, and then what?
What I suggest is that you adopt an approach that encompasses multiple
nodes which you scale as needed.

> It would greatly simplify things if I could set up the 30TB as a single
> volume mounted as /storage on the server.

Yes, and you can do that in a much more scale-out capable approach that
still involves DRBD, and does not deal with dual-Primary configurations.
Actually, you get to choose from several, I'm mentioning two here:

1. A multiple-node filesystem such as Lustre, achieving both metadata
and object storage redundancy with DRBD. Now, Lustre isn't particularly
easy to set up and requires running a modified kernel, but it scales
well and to huge storage sizes.

2. A super simple approach where you just use NFS. Any client can mount
any number of NFS exports into a common hierarchy, and you can scale
your setup almost indefinitely.

> 
> Of course, here is where I'm running into some partition size limits on
> various ends.
> My intention was to use the OCFS2 filesystem. But here I ran into my
> first problem. If I use 4k blocks then the maximum partition size is
> 16TB. Unless I use the 64bit Journal option, OR I use a different blocksize.
> 
> So I have a couple of questions and observations before I decide for a
> definite path.
> 
> 1. First of all, am I on the right path? Is there perhaps a different
> approach I should be using?

See above.

> 2. Are there any drawbacks to creating a filesystem with lets say 1M
> blocks?

Well, having to write to access the block device in 1M chunks if your
average file size is 1k is not exactly a stellar idea, but you probably
guessed that already.

> 3. I need to create everything in such a way, that I can seamlessly
> extend the storage later on. e.g. simply resize the 30TB volume, without
> having to reformat everything. Since, as you can imagine, shuffling
> around 30TBs of data takes forever, no matter how much bandwidth you have.

Yep, in the NFS or Lustre approach you just keep adding boxes as you grow.

> 4. Has anyone had experience with running a storage cluster of this
> size? e.g. over 16TB.

Yes; for example, iSCSI target clusters with DRBD exist that are well in
excess of 16TB storage.

> 5. Most tutorials I've seen on DRBD, suggest going with OCFS2.
> Is there any inherent advantage to OCFS2 or GFS?

Some users report a much more stable user experience on OCFS2, but, of
course, your mileage may vary.

> 6. Is a third "brain" node necessary for my setup?
> Each of our storage nodes serves data to one of two loadbalanced apache
> servers. As such the "loadbalancing" is kind of automatic, at least as
> far as web requests are concerned.
> 
> 7. Please post and further suggestions you might have.

Have you considered that in your 2-node scenario, two boxes pumping out
30TB worth of data are going to get clobbered pretty hard on the
network? How are you going to scale that? That's another consideration
to take into account which a multiple-node approach normally addresses
better.

Hope this helps.

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110909/ce73a7bb/attachment.pgp>