[DRBD-user] BUG: High I/O Wait and reduced throughput due to wrong I/O size on the secondary

Felix Zachlod fz.lists at sis-gmbh.info
Mon Mar 18 15:58:09 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello Lars, thans for your reply.

> > incompatibility with the raid card driver?!
> 
> Did you tell us which driver that would be?

Of course. These are LSI Megaraid SAS Cards of the type 9280 (8e and 4i4e)
two of them in each node.

# modinfo megaraid_sas

description:    LSI MegaRAID SAS Driver
author:         megaraidlinux at lsi.com
version:        00.00.06.12-rc1

this is the primary

and this the secondary

description:    LSI MegaRAID SAS Driver
author:         megaraidlinux at lsi.com
version:        00.00.06.12-rc1

> What does your device stack look like?

It looks like the following

Sata / SAS harddisks (we are running some raid sets with sas and some with
sata) => LSI 9280 (raid 5/6) => drbd (on top of a guid partition or on top
of a complete LD) => lvm2 => scst

> DRBD does no request merging.

That is interesting. I do not know if I interpret the iostat values
correctly. I assumed that the tps shown for a device reflect how many i/os
it has to handle and as drbd is handling much more tps on the primary as the
backing disk is I assumed drbd would do the merging.

> If coming from the page (or buffer) cache, IO is simply 4k.
> (That's all normal file IO, unless using direct io).
> Those are the requests that DRBD sees in its make_request function.
> That's just the way it is.

Yes, we were talking about file i/o here. I already found out that block i/o
(or O_DIRECT) is issuing larger i/os and performing much better in my setup.
We were using file i/o for our scsi target here because we wanted to use our
ram for read caching.

> It is the lower level device's IO scheduler ("elevator") that does the 
> aggregation/merging of these requests before submitting to the backend 
> "real" device.

Mmmh okay... but why is it merging requests on the primary and not for the
same (replicated) requests on the secondary. That sounds strage for me-  I
tried accessing the block device on the secondary directly too (without
O_DIRECT) and I don't see any 4k i/o here.

> > BUT lvm2 itself
> 
> No, I don't think device mapper does such thing.
> Well, "request based" device mapper targets will do that.
> not sure if you actually use those, though. Do you?

Mh I must admit I don't know if it operates request-based. It is just the
standard lvm2 package from the debian squeeze repos.

> But obviously *something* is different now, which allows the lower 
> level IO scheduler to merge things.

Yes of course. SOMETHING is different. But I can't tell what from all my
tests. As I said interestingly it does not matter which oft he servers if
primary for a drbd device I tried swapping that and the primary does ALWAYS
merge. The secondary NEVER - so I must assume that the megasas driver is
capable of doing this on both nodes. (In my test setup primary does as well
as secondary - cannot reproduce the problem here)

Regards, Felix




More information about the drbd-user mailing list