[DRBD-user] drbd8 and 80+ 1TB mirrors/cluster, can it be done?

Wed May 28 16:48:02 CEST 2008

This is very timely feedback, thanks to everyone that has taken the  
time to respond in such detail :-)

>> In fact - scratch that - the bottleneck will almost certainly be  
>> the network device you will be doing mirroring (DRBD) over, even if  
>> you are using multiple bonded Gb ethernet NICs. So the overhead of  
>> spending a bit of CPU on RAID6 is certainly not going to be what  
>> will be holding you back.

Yes, network bandwidth is my limiting factor. Not only do I mirror the  
data via DRBD, but the data that's being written is coming across the  
network to begin with.

>> Please read the archives of linux-raid as to what is the recommended
>> raid5 size (as in number of drives); it's definitely below 20.
>
> On _anything_ RAID5 generally makes sense up to about 20 disks,  
> because the failure rate will mean you need more redundancy than  
> that. RAID6 should just about cope with 80 disks, especially if you  
> are looking at mirroring the setup with DRBD (effectively giving you  
> RAID61).

Ah, my misunderstanding... I was thinking of RAID5 and RAID6 in groups  
of more like 5 disks which is why I thought the overhead was so high.

> Resyncing will involve lots of pain anyway. Have you checked how  
> long it takes to write a TB of data?? RAID6 will keep you going with  
> 2 failed drives, and if you do it so that you have a RAID6 stripe of  
> mirrors (RAID16), with each mirror being a DRBD device, it would  
> give you pretty spectacular redundancy, because you would have to  
> lose three complete mirror sets.

Syncing 8 1 TB drives across a GB switch is going to take 14+ hours :-/

But I'm a bit confused here.... My original proposal was essentially  
RAID1+JBOD with RAID1 provided by DRBD and JBOD by LVM. In this setup,  
a single drive failure would be handled transparently by the DRBD  
driver without the need for a cluster failover. Am I understanding  
this correctly?

I also don't see why I would need to sync more than a single drives  
worth of data to recover from a failure..... Since drives are mirrored  
before being combined into LVM volumes, data loss will only happen if  
I lose both sides of a mirror.

That said, I'm intrigued by the idea of using RAID5 or RAID6 instead  
of LVM to create my logical volumes on top of DRBD mirrors.... It adds  
a bit more redundancy at a reasonable price. In addition, while a  
drive failure in my setup would not cause a cluster failover, I think  
I would need to failover the cluster to *replace* the bad drive or  
even re-mirror the DRBD set to another drive. Is this correct? Am I  
correct in thinking that RAID5/6 would solve this? With the added  
complexity of heartbeat running on top of all this it will be  
interesting to see if I can get all this configured correctly ;-)

>> http://www.addonics.com/products/raid_system/rack_overview.asp and
>> http://www.addonics.com/products/raid_system/mst4.asp
>>
> Those don't seem to wind up being all the much cheaper, given the
> density (is your rack space "free"?) and the lack of hot-swap (or
> at least swap w/o having to dismantle the box) ability.

It's not that rack space if free but rather than I'll run out of power  
in my current colo before I run out of space. As a result, density is  
not my primary goal. With the 4 drive enclosures my plan is to leave  
dead drives in place until I have at least 2 failures in an enclosure  
at which point I can pull/fix it as a single unit.

The 4U rack case I link to above is more expensive but after a few  
drive failures I may decide it worth the price ;-)

> At least I'm not suggesting that you should get a "Thumper", but I'm  
> sure
> for some people that is the ideal and most economic solution.
> (http://www.sun.com/servers/x64/x4500/)

Wow, that's about $1.30 per unformatted, non-redundant raw storage.  
I'm currently at about $0.70/GB for formatted and mirrored storage in  
a "share nothing" cluster :-) My goal is to get below $0.50/GB but  
I'll need a little help from Moore's Law to get there. That or I  
figure out a way to get high-availability and data redundancy without  
a RAID1 mirror in the mix. At $200 for a 1TB SATA drive, I can't get  
below $0.25/GB ;-)

> I'm always aiming for the cheapest possible solution as well, but
> this is always tempered by the reliability and how serviceable the
> result needs to be.

Agree. And I'm not sure I've made the right trade-offs here... I've  
got lots of cuts on my hands from working with the 4 drive bays which  
suggest I've saved a bit too much money :-/

> Come again? I was suggesting an overhead of 2 drives, which comes to  
> 2.5%
> with 80 drives. Other than that RAID5 is free (md driver) and you sure
> were not holding back with CPU power in your specs (less, but faster
> cores and most likely Opteron instead of Intel would do better in
> this scenario).

The 'overkill' on CPU was an accident of Dell promotional pricing....  
They threw in the 2nd CPU for free ;-)

>> I'm using DRBD in part because it both
>> replicates data and provides high-availability for the servers/
>> services. I'll have some spare drives racked and powered so when
>> drives go bad I can just re-mirror to a good drive leaving the dead
>> device in the rack indefinitely.
>>
> Er, if a drive in your proposed setup dies that volume becomes  
> corrupted
> and you will have to fail over to the other DRBD node and re-sync.

I'm mirroring first and then using LVM to create JBOD volumes. So if  
one drive fails, DRBD will just handle it transparently. A fail-over  
would be required to fix the drive but even in that case the most I  
would need to resync is a single drive. RAID5 and RAID6 is similar in  
the sense that a failure will require resyncing the parity drives....

>> Has anyone else tried to do something like this? How many drives can
>> DRBD handle? How much total storage? If I'm the first then I'm
>> guessing drive failures will be the least of my issues :-/
>>
> If you get this all worked out, drive failures and the ability to
> address them in an automatic and efficient manner will be the issue  
> for
> most of the lifetime of this project. ^_-

Ah, so true.... :-/

Thanks again for all the great feedback!

Tim