[DRBD-user] Performance regression with DRBD 8.3.12 and newer

Mon Jun 11 22:54:58 CEST 2012

On 06/11/2012 04:31 PM, Florian Haas wrote:
> On 06/11/12 22:14, Matthias Hensler wrote:
>> On Mon, Jun 11, 2012 at 06:35:18PM +0200, Matthias Hensler wrote:
>>> [...]
>>> I checked the changelog for 8.3.12, but nothing obviously struck me.
>>> Also diffing the sourcetrees 8.3.11->8.3.12 I did not find any
>>> obvious.
>>
>> Let me follow up on this myself. As suggested on IRC I tried to build
>> drbd from source, just to take the elrepo packages from the equation.
>>
>> So I started with DRBD 8.3.13, and as expected I had a low performance.
>>
>> Then I tried 8.3.11, and I also had a low performance (although 8.3.11
>> from elrepo worked fine).
>>
>> That left me puzzled for a while, since I examined the elrepo packages
>> more closely. As it seemed, all working drbd versions where build on
>> 2.6.32-71, while all broken versions where build on 2.6.32-220.
>>
>>
>> So, I installed the old el6 2.6.32-71 kernel (took me a while to find
>> it, since it was removed from nearly all archives) and its devel
>> package, booted into that kernel and build two new versions from source:
>> 8.3.11 and 8.3.13. Then I booted back to 2.6.32-220.
>>
>> First try with my selfcompiled 8.3.11 modules: everything is fine.
>> Second try with my selfcompiled 8.3.13 modules: still everything is
>> fine.
>>
>> Indeed, the problem lies within the kernel version used to build the
>> drbd.ko module. I double checked by using all userland tools from 8.3.13
>> elrepo build together with my drbd.ko build on 2.6.32-71 (but run from
>> 2.6.32-220).
>>
>> Just to be clear: all tests were made with kernel 2.6.32-220, and the
>> userland version does not matter.
>>
>> drbd.ko              | 8.3.11 | 8.3.13
>> ---------------------+--------+-------
>> build on 2.6.32-71   | good   | good
>> build on 2.6.32-220  | bad    | bad
>>
>>
>> So, how to debug this further? I would suspect looking at the symbols of
>> both modules might give a clue?
>
> As a knee-jerk response based on a hunch -- you've been warned :) --,
> this could be related to the BIO_RW_BARRIER vs. FLUSH/FUA dance that the
> RHEL 6 kernel has been doing between the initial RHEL 6 release, and
> more recent updates (when they've been backporting the "let's kill
> barriers" upstream changes from post-2.6.32).
>
> Try configuring your disk section with no-disk-barrier, no-disk-flushes
> and no-md-flushes (in both configurations) and see if your kernel module
> change still makes a difference.
>
> Of course, in production you should only use those options if you have
> no volatile caches involved in the I/O path.
>
> Not sure if this is useful, but I sure hope it is. :)
>
> Cheers,
> Florian
>

Oh! Please let me know if this works. :)

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com