[DRBD-user] Definitive answer on >4TB DRBD volumes

Tue Jul 22 16:14:19 CEST 2008

On Tue, Jul 22, 2008 at 6:01 PM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
> On Tue, Jul 22, 2008 at 05:03:23PM +0800, Patrick Coleman wrote:
>> Hi,
>>
>> I've googled around for a while, and I can't find anything definitive
>> for or against - what is the maximum volume size supported by DRBD?
>>
>> I'm running an 11TB DRBD 8.2.6 volume between two nodes, connected by
>> 10GE. I've hit some odd issues (OOPSes, continually resyncing data)
>> and I'd like to eliminate the volume size as a cause of the issue.
>
>
> DRBD 8.0.x, 8.2.6:
>  32bit kernel:
>        4 TB hard limit per device
>        you can have several of them, but you probably run into some
>        other limit pretty fast.
>  64bit kernel:
>        4 TB "supported".
>        (unsupported theoretically) 16 TB hard limit per device,
>        you can have several of them, but you probably run into some
>        other limit pretty fast.
<snip>
> Did that help?

mm, thanks for making that clear.

The boxes are both Dual-Quad-Core Xeons with 8GB of RAM, running a
Debian 2.6.22-amd64 kernel, so memory shouldn't be a problem. I'll
describe my current problems in more detail, and perhaps you'll be
able to tell me whether it seems to be related at all to the size of
the device (though it does sound likely, given you've had reports of
instability).

Firstly, DRBD seems to think it's permanently out of sync. I installed
8.0.12 (Debian testing) and ran the initial sync, and everything went
fine. Then I rebooted the secondary. After each reboot, it says about
3.9TB is out of sync and resyncs it. During the resync, the oos field
in /proc/drbd drops to zero. This completes, but then if I check
/proc/drbd the oos field is static at about 3.9TB, though the states
are UpToDate/UpToDate. Connecting and reconnecting makes it resync,
but has the same effect as for a reboot. Invalidating and resyncing
the secondary had no effect.

I then upgraded to 8.2.26, compiled from the Debian source package.
This all worked ok, and a resync happened as expected.

I tried blowing away the secondary and rebuilding it from scratch.
This seemed to work ok, and started doing the initial sync, but
crashed the secondary towards the end. After rebooting, it went back
to its resyncing 3.9TB thing.

I didn't trust the data on the secondary at this point, so I tried the
new verify feature from the primary. This went through to the end and
found the 3.9TB OOS but crashed the primary after it had just
finished, looking at the logs on the secondary.

The primary then decided its own 3.9TB was out of sync, and resynced
from the secondary. It's currently doing the same thing it was doing
before, with the large oos value in /proc/drbd:

version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
phil at fat-tyre, 2008-05-30 12:59:17
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:1000728252 dw:2122151308 dr:2008 al:1 bm:259638 lo:0 pe:0
ua:0 ap:0 oos:3128329444

I was considering moving to 4x3TB DRBD volumes this weekend, and see
if that helps, but from what you say it might not make much
difference. If you think this is worth trying then I'll give it a go
anyway.

The issue is that I don't know whether the instability is caused by
DRBD or something else in the system (they're both mostly identical).
By the time I get to the box the terminal has blanked itself, so I
can't see the backtrace. There's nothing in the logs. It may be worth
connecting up a serial console, but I've only had two crashes in as
many months so it's going to be a while before I get anything.

One other thing I've noticed is that the machines started crashing
when I upgraded - would downgrading help?

Any suggestions you have at all would be most welcome.

Cheers,

Patrick

-- 
http://www.labyrinthdata.net.au - WA Backup, Web and VPS Hosting