[DRBD-user] DRBD serious locking due to TOE - UPDATE

Tue Jan 8 18:04:49 CET 2008

(Second send, sorry, I used the wrong email again. :)

Lars,

Thanks for understanding my problem and finding the links.

After reading the links, it seems at least very likely that my problem
is the file system.  I have all the symptoms: a single large reiserfs
file system experiencing high bandwidth, and multiple CPU's.

The kernel patch to correctly order journal commits is in my 2.6.18.8
kernel.  Or at least with the source code SUSE supply with it :)

Have any other members experienced falling DRBD performance on similar
setup?

Changing a file system is simple.  Copy data off, mkfs, back on.  As
long as I can change the file system on a Primary Connected DRBD
resource without issue?

What do you all recommend for a MySql system, ext2 or ext3?  Or any
others SUSE have given me:

mkfs.cramfs    mkfs.ext3      mkfs.minix     mkfs.ntfs      mkfs.vfat
mkfs.bfs       mkfs.ext2      mkfs.jfs       mkfs.msdos     mkfs.xfs

?

I shall let you know...

Ben

Lars Ellenberg wrote:
> On Tue, Jan 08, 2008 at 01:09:36PM +0000, Ben Clewett wrote:
>>
>> Hi Lars,
>>
>> Lars Ellenberg wrote:
>>> grrr. stripp of one of the nots, please,
>>> either "not likely" or "likely not".
>>> anyways, it is NOT a problem in drbd.
>>> but you knew what I meant, right?
>> Many thanks, lots of good information to work with.
>>
>> I knew what you meant.  I don't believe it is a problem with DRBD.  But 
>> it manifests through DRBD.  There is a know problem with my NIC, even at 
>> the latest driver:
>>
>> http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1199792167647+28353475&threadId=1186722
>>
>> However the suggested fix here doesn't make a difference:
>>
>> # ethtool -K eth2 tso off
>>
>> I have a known problem which does not respond to the known fix!
>>
>> But I still suspect the NIC's...  Tonight I am going to move them all 
>> around to see if the problem follows the card.
>>
>> --------
>>
>> I also know that if I reboot the servers, I'll get a few days of grace 
>> before the locking returns.  I don't think that is DRBD related?
>>
>> The other think I note is that slowing data input by introducing small 1 
>> second gaps every half minute or so, actually increases the rate at 
>> which data is accepted.  What could cause that affect?
> 
> don't start me speculating on that :)
> 
>> --------
>>
>> The analysis of /proc/drbd shows data to be moving.  But I can see a 
>> pattern:
>>
>> With my symmetrical system, on both servers:
>> # watch -n 0.1 -d cat /proc/drbd
>>
>> The DRBD resource which is not 'locked' is moving data continuously. 
>> lo:1 is the normal condition, ns and dw are increasing linearly.
>>
>> However the DRBD disk which is locked is only sporadically moved data. 
>> It can spend up to ~2 seconds stuck on:
>> 	Primary:	bm:0 lo:n pe:0 ua:0 ap:n
>> 	Secondary:	bm:0 lo:0 pe:0 ua:0 ap:0
>> Where n is some number between 1 and ~ 50.
> 
> in that case, it is the _local io subsystem_ on the Primary,
> that does not return the requests.
> 	ap: application bios comming from upper layers (file system),
> 	    but not yet reported as completed to upper layers
>         lo: submitted to local disk, but not completed from there yet.
>         pe: sent to peer, but not yet acknowledged (pending peer)
> 
> we can only report completed to upper layers (decrease ap) when
> both corresponding lo and pe requests have completed.  since your pe is
> 0, but ap and lo are n, your local disk does not respond (quickly enough).
>  
>> It looks like the traffic flow from one server is blocking the traffic 
>> flow from the another, like it has a higher priority, or is being 
>> traffic shaped?.  This might also explain by introducing 1 second gaps 
>> helps get data moving, it stops one server letting the other work.  But 
>> it might also be coincidence...
>>
>> Can you think of anything in the kernel which might effect this?
> 
> file system waits for these n requests to complete before it sends more,
> maybe write after write dependency on some journal commit.
> 
>> ----
>>
>> Another idea if it the case my resources are clashing for bandwidth.  I 
>> have two resources sharing a gigabit connection.  My rate = 100M.  If I 
>> set this to 50M, would this ensure that each resource could not use more 
>> than 50% of the connection?
> 
> the sync rate is only relevant during _resynchronization_,
> and is a throttle for the resync process to leave bandwidth for the
> normal, ongoing, _replication_ processs.
> as long as drbd is cs:Connected, the "syncer rate" is irrelevant.
> 
>> ----
>>
>> You also asked about mounting:
>>
>> # cat /proc/mounts | grep drbd
>>
>> /dev/drbd1 /dbms-07-02 reiserfs rw 0 0
>>
>> Do you know of any options which might help?
> 
> google for: reiserfs stall
> 
> there repeatedly have been issues.
> 
> since you are using suse 10.2, suse kernel 2.6.18-somthing,
> have a look at
>  http://linux.wordpress.com/2006/09/27/suse-102-ditching-reiserfs-as-it-default-fs/
> where Jeff Mahony is cited as
>  
>  ReiserFS has serious performance problems with extended attributes and
>  ACLs. (Yes, this one is my own fault, see numerous flamewars on lkml and
>  reiserfs-list for my opinion on this.) xattrs are backed by normal files
>  rooted in a hidden directory structure. This is bad for performance and
>  in rare cases is deadlock prone due to lock inversions between pdflush
>  and the xattr code.
> 
>  also you could try to find out whether this upstream commit
>   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a3172027148120b8f8797cbecc7d0a0b215736a1
>   [Fix reiserfs latencies caused by data=ordered]
>  is included in your kernel or not.
>  [data=ordered is default, and recommended].
> 
> also, not this is but speculation,
> but it may well be that even the latest "fixes" don't fix the cause (some race
> condition), but make the symptom go away by reducing the likelyhood of the race
> to trigger. and then introduce drbd replication, and your broken nics, which
> changes the latency and thus io-timing a lot, and suddenly the same (or some different)
> race triggers again?
> 
> is changing file system type an option?
> or changing kernel version?
> would you be able to try this on a test cluster?
> 
> anyways, this starts to become off topic for a _drbd_ list...
> 

*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
  Registered in England No: 02017435, Registered Address: Charter Court, 
  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
*************************************************************************