[DRBD-user] Re: [OT] rsync issues [Was Re: Read performance?]

Wed May 30 21:35:29 CEST 2007

On Wednesday 30 May 2007 12:29:50 you wrote:

> You can see 2min, 20 sec. to create the file list of 283,000 files.
> You must have many millions of files if your running out of time on
> that step.

Possibly. I'll let you know how many I actually end up having by the time this 
is done. A quick check on the IMAP server (which I haven't even started on) 
shows 258,967 files -- but only around 10 gigs used on the partition. (We 
love reiserfs!)

Keep in mind, there will be multiple backups there -- at least a week's worth, 
if not a month's worth, all kinds of hardlink tricks. Which means, of course, 
that rsync has to go build up a list of everything to figure out what's a 
hardlink and what isn't. So, judging by what I've seen so far, I'm guessing 
around 300k files are being backed up -- multiply that by a week's worth of 
daily backups and a month's worth of "full" backups and that's around 10 
versions, so around 3 million files, at least.

It wasn't actually running out of time (it had all weekend), so much as 
running out of RAM. This was a few years ago, so it might be fine on the 1-2 
gigs of RAM we could just throw at it today, but it would make me nervous, 
considering how badly it died before. Also, even if it needed 4 gigs of RAM 
to do this, that means it only needs about 400 gig to do each individual 
backup (and next to nothing at all for the drbd sync).

> FYI2: In theory rsync can survive a crash in the middle if you have
> the right parameters.  I use:
> rsync -avh --stats --links --partial-dir=/remote_backup_transfer_dir
> --timeout=1800 /local_backup_dir login at server:remote_backup_dir/
>
> The partial-dir and the timeout took some tweaking for me to figure out.
>
> The partial-dir says to leave failed transfers in the transfer dir for
> use on a future rsync call.  The timeout had to be long because I had
> some multi-gigabyte files fail in the middle and it was taking rsync a
> long time to restart their transfer on the next invocation.  I assume
> it was running checksums to verify the partial file it had from the
> previous run.

Ok, but the question is not whether it can survive a crash (where I simply 
reboot and tell it to keep going), but whether it can survive a more 
permanent failure. As in, crash, and the local source is unrecoverable.

You say partial-dir keeps failed transfers there. What about successful 
individual files, but an overall failed transfer? Also keep in mind, it's 
only a 70 gig partition for backup at each end, so not a lot of room for 
duplicate data.

In any case, neither issue is crucial -- DRBD is working for us, so long as no 
one needs to restore something while another backup is running. It just 
strikes me as amazingly stupid that it doesn't appear to be multithreaded at 
all, and it's not hard to imagine situations that this would be completely 
unusable for.

For example, DRBD+OCFS2 on a pair of load-balancing, redundant webservers, 
connected over relatively low bandwidth / high latency / the Internet -- now 
try unpacking some large-ish new web package onto the shared partition. 
Suddenly, every website running off of these would be DOWN until the transfer 
was complete.

That's a purely hypothetical situation, but really, how hard would it be to at 
least have a reader thread and a writer thread? (Don't you already have that, 
anyway? Maybe better locking or something?)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070530/d5700a6d/attachment.pgp>