[DRBD-user] Drbd hang on write

Fri Jun 23 10:10:27 CEST 2006

/ 2006-06-22 15:24:58 -0400
\ Claude Pelletier:
> 
> Hi All,
> 
> I have 2 IBM I series server using Linux
> 
> The drbd 0.7.19 version is install on them.
> the 2 server are running on a 10mb lines in a wan configuration
> 
> The thing I would like to do is explain in detail what's happenning and the way
> I see it.
> 
> 1 - all_extents,protocol and snd_buffer parameter have been changed
> 2 - Take drbd down and back up on both side( to make sure the changed have take
> effect)
> 3 - Start a copy of a 300MB files on the partition /dev/drbd2
> 4 - The copy goes all the way
> 5 - After about 30 sec to 1 minute when the copy finish we can't have access to
> the /dev/drbd2 partition (true win samba or just doing a ls of the partition)
>      all the other drbd partition and the system it self show no degradation.
> 6 - We see in the cat /proc/drbd the bytes of this partition going from primary
> to secondary
> 7 - When the copy is done from primary to secondary the partition /dev/drbd2
> become back available and performance is
>      back to normal on this partition(no other part of the linux system is
> affect by this)
> 
> 
> So what I see in all this.
> 
> It look like drbd doesn't really do is copy from primary to secondary in the
> background.

there is no "background" or "forground".
drbd is synchronous.

> My impression was that drbd would complete is copy in the backround with out
> slowing down the access to the fs on the primary machine.
> I really hope this is not a concept issue.

maybe a misconception on your side.
obviously drbd cannot write faster than either of your io subsystems,
nor the replication network bandwith.

what write rate do you observe? [*]
what is your raw network bandwidth?

[*] write rate: _including_ fsync. "time cp hugefile somewhere" does not
count, since cp does not do fsync (afaik). there are plenty of benchmark
tools out there, as a rough estimate something like
"sync;sync; time dd if=/dev/zero bs=1M  count=1024 of=blob ; time sync; time sync;"
could do...

> If I copy let say a 500MB files the same thing happen except it happen even
> before the copy to the primary finish
> and it can even abort the copy.

well, smaller than that might fit in your local cache, and somewhen
later the file system decides to flush it.  larger than this, and it
needs to flush it to disk even during operation.

> Im really suprise this thing didn't pop up before in drbd forum.
> To me it's basic.

tell me your network bandwidth, disk throughput and observed write
rates with connected drbd, and we'll see what is basic.
maybe we can tune something. alas without knowing the hard limits and
the currently achieved figures, this is hard to tell.

> I really hope there is a parameter somewhere that would fix this.
> 
> The way I see it, it's really a drbd problem cause the system itself still
> response very good.
> 
> It's really the partition that hang until it complete a copy from primary to
> secondary.
> 
> All the other partition under drbd doesn't hang.
> 
> To make all parition hang I would just copy 3 files in each partition
> and I would be able to hang all the drbd partition at once.

or change to a 2.4 kernel :)

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.