[Drbd-dev] DRBD small synchronous writes performance improvements

Lars Ellenberg lars.ellenberg at linbit.com
Thu Apr 29 23:26:00 CEST 2010

On Thu, Apr 29, 2010 at 04:00:50PM -0400, Guzovsky, Eduard wrote:
> Hi guys,
> We analyzed DRBD performance of small synchronous write operations on
> systems with RAID controllers.  This i/o pattern happens frequently in
> data base transaction processing workloads. Large RAID caches ensure
> that disk i/o overhead is small - about 100us per 32KB block - and
> network overhead turns into a dominant factor.  In addition to network
> stack processing time, network overhead has two large components
> 1. "Wire time" - the actual time it takes to transmit data. On 1Gig
> network it takes about 270us to transmit 32KB block, on 10Gig - about
> 27us.
> 2. "NIC latency" - the time it takes NIC to start transmitting a packet
> on the sending node plus the time on the receiving node between the
> packet reception by the NIC and delivering the packet to the driver. NIC
> latency is chip specific and depends on "packet/interrupt coalescing"
> setting configurable via "ethtool -C ". The default setting results in
> about a 100us to 125us "short packet round trip" latency.  We reduced it
> to about 50us by effectively disabling packet coalescing.  Have you or
> your customers experimented with this? We are obviously concerned with
> adverse side effects of disabling packet coalescing on whole system
> performance during high network loads.
> We found in our testing that certain DRBD changes - see the patch bellow
> - improve performance of small synchronous writes.  For the reference,
> the testing was done in Xen environment on Dell T610 system with Xeon
> E5520 2.27 GHz CPU, and Dell PERC 6/I RAID controller.  DRBD code was
> running in the 2 way SMP Dom0.
> The patch is made against DRBD version 8.2.7, but it is equally relevant
> to 8.3.7 as corresponding parts of the code did not change
> significantly. This patch is a "rough draft request for comment". It
> contains several changes.
> 1. TCP_QUICKACK option is set incorrectly. The goal was force TCP to
> send and ACK as a  "one time" event.  Instead the code permanently sets
> connection in the QUICKACK mode.

Oh, it is not permanent, tcp will re-enable "pingpong" mode
when it "feels like it".  But you are right, using val = 2 will
re-enter pingpong mode immediately if there actually
have been pending ACKs forced out.

> 2. Socket sndbufsize/rcvbufsize setting is done incorrectly. The code
> sets socket buffer sizes _after_ connection is established.  In order
> for these settings to take effect they should be set _before_ connection
> is established.  We made a quick and dirty change that makes identical
> setting for both meta and data connections.  It would require a bigger
> change to have separate settings because in the current code it is not
> known in advance which socket will be used for which connection.

Apparently I need to re-read some kernel code on this.
If you want to point me to a specific area of code?

> 3. We noticed that on the primary node it takes about 20us to schedule
> DRBD worker thread that packages and sends write request to the
> secondary node. We think it would be better to send request to the
> secondary ASAP and only then continue with primary node processing. So I
> added "schedule()"  hack to drbd_make_request_common() and raised the
> priority of the worker thread. That reduced worker thread scheduling
> delay to about 7us. I am not 100% if this hack is safe - would be very
> interested in your opinion on it.

That's an interessting hack ;-)
What priority do you chose?
What is your "cpu-mask" for the drbd threads?

> 4. We disabled TCP_CORK through drbdsetup utility and modified the code
> to do implicit corking using MSG_MORE flag. TCP code tries to postpone
> sending partial message until the whole message is assembled. So we try
> to send drbd request header first to let the secondary node start
> preparations to receive the data part while the primary node is still
> transmitting the data. May be this behavior  should be a configurable
> variant of tcp corking, because it might not be advantageous for every
> NIC/link speed configuration.

We'll see what this does to our test hardware.  Anyways, if it seems to
be beneficial for you, we can certainly add some config option for it.

> We were also considering implementation of "zero copy" receive that
> should improve performance for 10Gig links - that is not part of the
> patch. The basic idea is to intercept incoming data before they get
> queued to the socket in the same way NFS and i-scsi  code do it through
> tcp_read_sock() api.

Yes. I recently suggested that as an enhancement for IET on the
iscsit-target mailing list myself, though to get rid of an additional
memcopy they do for their "fileio" mode.

I don't think it is that easy to adapt for DRBD (or their "block io"
mode), because:

> Then drbd could convert sk_buff chain to bio, if alignment is right,

that is a big iff.
I'm not at all sure how you want to achieve this.
Usually the alignment will be just wrong.

> and avoid expensive data copy. Do you plan to add something like this
> into you future DRBD release so we do not have to do it ourselves? ;-)

Something like this should be very beneficial, but I don't see how we
can achieve the proper alignment of the data pages in the sk_buff.

"native RDMA mode" for DRBD would be a nice thing to have, and possibly
solve this as well.  Maybe we find a feature sponsor for that ;-)

> We would appreciate your comments on the patch.

Will have to do actual review within the next few days.


>  	ok = (sizeof(p) ==
> -		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), MSG_MORE));
> +		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));

> @@ -2234,7 +2241,7 @@
> -			ok = _drbd_send_zc_bio(mdev, req->master_bio);
> +			ok = _drbd_send_zc_bio(mdev, req->master_bio, MSG_MORE);

Ok, I see where you are going.
Maybe rather not have the flags in _drbd_send_zc_bio, but have the
_drbd_send_zc_bio itself add MSG_MORE to all but the last sendpage?

> @@ -2281,7 +2288,7 @@
> -		ok = _drbd_send_zc_bio(mdev, e->private_bio);
> +		ok = _drbd_send_zc_bio(mdev, e->private_bio, 0);

no MSG_MORE here?

> +
> +     /* give a worker thread a chance to pick up the request */
> +	if (remote) {
> +            if (!in_atomic())
> +                    schedule();

You may well drop the if (!in_atomic()),
it cannot possibly be in atomic context there.
Also, the immediately preceding spin_unlock_irq() is a pre-emption
point.  So actually this should not even be necessary.

> +     }
> +
>  	kfree(b); /* if someone else has beaten us to it... */
>  	if (local) {
> diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
> --- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000 -0400
> +++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000 -0400
> @@ -1237,6 +1237,9 @@
>  	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));
> +	current->policy = SCHED_RR;  /* Make this a realtime task! */
> +	current->rt_priority = 2;    /* more important than all other
> tasks */
> +

Not sure about this.
I don't really want to do crypto operations
from a real time kernel thread.

: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

More information about the drbd-dev mailing list