[DRBD-user] drbd8 and 80+ 1TB mirrors/cluster, can it be done?

Wed May 28 18:36:32 CEST 2008

On Wed, 28 May 2008, Tim Nufire wrote:

> This is very timely feedback, thanks to everyone that has taken the time to 
> respond in such detail :-)
>
>
>>> In fact - scratch that - the bottleneck will almost certainly be the 
>>> network device you will be doing mirroring (DRBD) over, even if you are 
>>> using multiple bonded Gb ethernet NICs. So the overhead of spending a bit 
>>> of CPU on RAID6 is certainly not going to be what will be holding you 
>>> back.
>
> Yes, network bandwidth is my limiting factor. Not only do I mirror the data 
> via DRBD, but the data that's being written is coming across the network to 
> begin with.

Split the replication/cluster traffic from the user/application traffic. 
That will help, especially while resyncing after one of the servers gets 
rebooted for whatever reason.

>>> Please read the archives of linux-raid as to what is the recommended
>>> raid5 size (as in number of drives); it's definitely below 20.
>> 
>> On _anything_ RAID5 generally makes sense up to about 20 disks, because the 
>> failure rate will mean you need more redundancy than that. RAID6 should 
>> just about cope with 80 disks, especially if you are looking at mirroring 
>> the setup with DRBD (effectively giving you RAID61).
>
> Ah, my misunderstanding... I was thinking of RAID5 and RAID6 in groups of 
> more like 5 disks which is why I thought the overhead was so high.

No, the problem is that as the number of disks increases, the probability 
of failure of any one disk goes exponentially, so having n+1 becomes 
insufficient for reasonable fault tolerance (RAID6 being n+2).

>> Resyncing will involve lots of pain anyway. Have you checked how long it 
>> takes to write a TB of data?? RAID6 will keep you going with 2 failed 
>> drives, and if you do it so that you have a RAID6 stripe of mirrors 
>> (RAID16), with each mirror being a DRBD device, it would give you pretty 
>> spectacular redundancy, because you would have to lose three complete 
>> mirror sets.
>
> Syncing 8 1 TB drives across a GB switch is going to take 14+ hours :-/

Exactly - so worrying about RAID performance/overheads is a tad academic 
compared to that.

> But I'm a bit confused here.... My original proposal was essentially 
> RAID1+JBOD with RAID1 provided by DRBD and JBOD by LVM. In this setup, a 
> single drive failure would be handled transparently by the DRBD driver 
> without the need for a cluster failover. Am I understanding this correctly?

Yes. But JBOD is essentially RAID0, and not fault tolerant. JBOD also 
won't give you nicely scalable performance of RAID0. So you might as well 
RAID6 the DRBD devices and have the best of both worlds with seriously 
increased rsilience on top.

> I also don't see why I would need to sync more than a single drives worth of 
> data to recover from a failure..... Since drives are mirrored before being 
> combined into LVM volumes, data loss will only happen if I lose both sides of 
> a mirror.

Yes, but this is essentially RAID10, which may or may not be adequate with 
as many as 80 disks. LVM has uses, but since software (and most modern 
hardware) RAID can be grown, using it for JBOD seems a bit dubious. The 
sort of thing that it IS useful for is snapshotting, if you need such a 
thing.

> That said, I'm intrigued by the idea of using RAID5 or RAID6 instead of LVM 
> to create my logical volumes on top of DRBD mirrors.... It adds a bit more 
> redundancy at a reasonable price. In addition, while a drive failure in my 
> setup would not cause a cluster failover, I think I would need to failover 
> the cluster to *replace* the bad drive or even re-mirror the DRBD set to 
> another drive. Is this correct?

Not if you can spin down the disk and remove it without dismantling the 
machine. Even if you lose 3 disks in one machine (which would normally 
kill RAID6), it would still keep going, because the 3 mirrors of those 
disks in the other machine (remember you are RAID6-ing DRBD mirrors) would 
still work (or at least one would hope they would still work).

> Am I correct in thinking that RAID5/6 would 
> solve this? With the added complexity of heartbeat running on top of all this 
> it will be interesting to see if I can get all this configured correctly ;-)

You would only need to fail over the cluster if one of the machines (e.g. 
CPU, motherboard) actually failed. Losing disks would be handled by 
RAID/DRBD.

>>> http://www.addonics.com/products/raid_system/rack_overview.asp and
>>> http://www.addonics.com/products/raid_system/mst4.asp
>>> 
>> Those don't seem to wind up being all the much cheaper, given the
>> density (is your rack space "free"?) and the lack of hot-swap (or
>> at least swap w/o having to dismantle the box) ability.
>
> It's not that rack space if free but rather than I'll run out of power in my 
> current colo before I run out of space. As a result, density is not my 
> primary goal. With the 4 drive enclosures my plan is to leave dead drives in 
> place until I have at least 2 failures in an enclosure at which point I can 
> pull/fix it as a single unit.

I don't see how you can afford that. If you start having failures in 
multiple enclosures, you will probably run out of redundancy before you 
have 2+ failures in a 4 drive box. Do the maths on the probability of
failures. I don't think even RAID 16 would give you enough leeway to be 
that slapdash with leaving failed drives in place.

> The 4U rack case I link to above is more expensive but after a few drive 
> failures I may decide it worth the price ;-)

Those few drive failures may end up being the few drives you really didn't 
want to lose. But either way, RAID16 will give you scope for losing _ANY_ 
5 drives out of the 80 without data loss, and potentially up to 42 drives 
out of 80 before any data loss. But you will need to keep a very careful 
eye on what drives you have lost, because you'll find that failures will 
be sufficiently distributed (assuming random failures) that you won't be 
able to wait for 2+ failures per enclosure.

Most importantly, if one whole enclosure goes down, you lose 4 drives, and 
you are one failure away from total data loss. If this happens after you 
had 3 drives already dead that you left until another drive fails, you can 
kiss all 38TB of your data goodbye.

>> At least I'm not suggesting that you should get a "Thumper", but I'm sure
>> for some people that is the ideal and most economic solution.
>> (http://www.sun.com/servers/x64/x4500/)
>
> Wow, that's about $1.30 per unformatted, non-redundant raw storage. I'm 
> currently at about $0.70/GB for formatted and mirrored storage in a "share 
> nothing" cluster :-) My goal is to get below $0.50/GB but I'll need a little 
> help from Moore's Law to get there. That or I figure out a way to get 
> high-availability and data redundancy without a RAID1 mirror in the mix. At 
> $200 for a 1TB SATA drive, I can't get below $0.25/GB ;-)

The only way you are likely to get there is by using more storage across 
more machines. If you scale it up, you could look into using something 
like cleversafe to give you a price advantage through better redundancy 
distribution.

>> Come again? I was suggesting an overhead of 2 drives, which comes to 2.5%
>> with 80 drives. Other than that RAID5 is free (md driver) and you sure
>> were not holding back with CPU power in your specs (less, but faster
>> cores and most likely Opteron instead of Intel would do better in
>> this scenario).
>
> The 'overkill' on CPU was an accident of Dell promotional pricing.... They 
> threw in the 2nd CPU for free ;-)

That's pretty irrelevant if these boxes are just dedicated SAN/NAS 
storage. The Gb NICs will bottleneck before the bus, which will bottleneck 
before the CPU.

>>> I'm using DRBD in part because it both
>>> replicates data and provides high-availability for the servers/
>>> services. I'll have some spare drives racked and powered so when
>>> drives go bad I can just re-mirror to a good drive leaving the dead
>>> device in the rack indefinitely.
>>> 
>> Er, if a drive in your proposed setup dies that volume becomes corrupted
>> and you will have to fail over to the other DRBD node and re-sync.
>
> I'm mirroring first and then using LVM to create JBOD volumes. So if one 
> drive fails, DRBD will just handle it transparently. A fail-over would be 
> required to fix the drive but even in that case the most I would need to 
> resync is a single drive. RAID5 and RAID6 is similar in the sense that a 
> failure will require resyncing the parity drives....

If you had reasonably accessible drives you wouldn't need to fail over 
even then. You could hot-swap the failed disks and rebuild online. The 
only time you would need to fail over is if the host machine itself 
failed.

Gordan