[DRBD-user] Performance regression with DRBD 8.3.12 and newer

Mon Jun 11 22:31:16 CEST 2012

On 06/11/12 22:14, Matthias Hensler wrote:
> On Mon, Jun 11, 2012 at 06:35:18PM +0200, Matthias Hensler wrote:
>> [...]
>> I checked the changelog for 8.3.12, but nothing obviously struck me.
>> Also diffing the sourcetrees 8.3.11->8.3.12 I did not find any
>> obvious.
> 
> Let me follow up on this myself. As suggested on IRC I tried to build
> drbd from source, just to take the elrepo packages from the equation.
> 
> So I started with DRBD 8.3.13, and as expected I had a low performance.
> 
> Then I tried 8.3.11, and I also had a low performance (although 8.3.11
> from elrepo worked fine).
> 
> That left me puzzled for a while, since I examined the elrepo packages
> more closely. As it seemed, all working drbd versions where build on
> 2.6.32-71, while all broken versions where build on 2.6.32-220.
> 
> 
> So, I installed the old el6 2.6.32-71 kernel (took me a while to find
> it, since it was removed from nearly all archives) and its devel
> package, booted into that kernel and build two new versions from source:
> 8.3.11 and 8.3.13. Then I booted back to 2.6.32-220.
> 
> First try with my selfcompiled 8.3.11 modules: everything is fine.
> Second try with my selfcompiled 8.3.13 modules: still everything is
> fine.
> 
> Indeed, the problem lies within the kernel version used to build the
> drbd.ko module. I double checked by using all userland tools from 8.3.13
> elrepo build together with my drbd.ko build on 2.6.32-71 (but run from
> 2.6.32-220).
> 
> Just to be clear: all tests were made with kernel 2.6.32-220, and the
> userland version does not matter.
> 
> drbd.ko              | 8.3.11 | 8.3.13
> ---------------------+--------+-------
> build on 2.6.32-71   | good   | good
> build on 2.6.32-220  | bad    | bad
> 
> 
> So, how to debug this further? I would suspect looking at the symbols of
> both modules might give a clue?

As a knee-jerk response based on a hunch -- you've been warned :) --,
this could be related to the BIO_RW_BARRIER vs. FLUSH/FUA dance that the
RHEL 6 kernel has been doing between the initial RHEL 6 release, and
more recent updates (when they've been backporting the "let's kill
barriers" upstream changes from post-2.6.32).

Try configuring your disk section with no-disk-barrier, no-disk-flushes
and no-md-flushes (in both configurations) and see if your kernel module
change still makes a difference.

Of course, in production you should only use those options if you have
no volatile caches involved in the I/O path.

Not sure if this is useful, but I sure hope it is. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now