[DRBD-user] DRBD serious locking due to TOE - UPDATE

Tue Jan 8 15:57:43 CET 2008

On Tue, Jan 08, 2008 at 01:09:36PM +0000, Ben Clewett wrote:
> 
> 
> Hi Lars,
> 
> Lars Ellenberg wrote:
> > grrr. stripp of one of the nots, please,
> > either "not likely" or "likely not".
> > anyways, it is NOT a problem in drbd.
> > but you knew what I meant, right?
> 
> Many thanks, lots of good information to work with.
> 
> I knew what you meant.  I don't believe it is a problem with DRBD.  But 
> it manifests through DRBD.  There is a know problem with my NIC, even at 
> the latest driver:
> 
> http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1199792167647+28353475&threadId=1186722
> 
> However the suggested fix here doesn't make a difference:
> 
> # ethtool -K eth2 tso off
> 
> I have a known problem which does not respond to the known fix!
> 
> But I still suspect the NIC's...  Tonight I am going to move them all 
> around to see if the problem follows the card.
> 
> --------
> 
> I also know that if I reboot the servers, I'll get a few days of grace 
> before the locking returns.  I don't think that is DRBD related?
> 
> The other think I note is that slowing data input by introducing small 1 
> second gaps every half minute or so, actually increases the rate at 
> which data is accepted.  What could cause that affect?

don't start me speculating on that :)

> --------
> 
> The analysis of /proc/drbd shows data to be moving.  But I can see a 
> pattern:
> 
> With my symmetrical system, on both servers:
> # watch -n 0.1 -d cat /proc/drbd
> 
> The DRBD resource which is not 'locked' is moving data continuously. 
> lo:1 is the normal condition, ns and dw are increasing linearly.
> 
> However the DRBD disk which is locked is only sporadically moved data. 
> It can spend up to ~2 seconds stuck on:
> 	Primary:	bm:0 lo:n pe:0 ua:0 ap:n
> 	Secondary:	bm:0 lo:0 pe:0 ua:0 ap:0
> Where n is some number between 1 and ~ 50.

in that case, it is the _local io subsystem_ on the Primary,
that does not return the requests.
	ap: application bios comming from upper layers (file system),
	    but not yet reported as completed to upper layers
        lo: submitted to local disk, but not completed from there yet.
        pe: sent to peer, but not yet acknowledged (pending peer)

we can only report completed to upper layers (decrease ap) when
both corresponding lo and pe requests have completed.  since your pe is
0, but ap and lo are n, your local disk does not respond (quickly enough).

> It looks like the traffic flow from one server is blocking the traffic 
> flow from the another, like it has a higher priority, or is being 
> traffic shaped?.  This might also explain by introducing 1 second gaps 
> helps get data moving, it stops one server letting the other work.  But 
> it might also be coincidence...
> 
> Can you think of anything in the kernel which might effect this?

file system waits for these n requests to complete before it sends more,
maybe write after write dependency on some journal commit.

> ----
> 
> Another idea if it the case my resources are clashing for bandwidth.  I 
> have two resources sharing a gigabit connection.  My rate = 100M.  If I 
> set this to 50M, would this ensure that each resource could not use more 
> than 50% of the connection?

the sync rate is only relevant during _resynchronization_,
and is a throttle for the resync process to leave bandwidth for the
normal, ongoing, _replication_ processs.
as long as drbd is cs:Connected, the "syncer rate" is irrelevant.

> ----
> 
> You also asked about mounting:
> 
> # cat /proc/mounts | grep drbd
> 
> /dev/drbd1 /dbms-07-02 reiserfs rw 0 0
> 
> Do you know of any options which might help?

google for: reiserfs stall

there repeatedly have been issues.

since you are using suse 10.2, suse kernel 2.6.18-somthing,
have a look at
 http://linux.wordpress.com/2006/09/27/suse-102-ditching-reiserfs-as-it-default-fs/
where Jeff Mahony is cited as

 ReiserFS has serious performance problems with extended attributes and
 ACLs. (Yes, this one is my own fault, see numerous flamewars on lkml and
 reiserfs-list for my opinion on this.) xattrs are backed by normal files
 rooted in a hidden directory structure. This is bad for performance and
 in rare cases is deadlock prone due to lock inversions between pdflush
 and the xattr code.

 also you could try to find out whether this upstream commit
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a3172027148120b8f8797cbecc7d0a0b215736a1
  [Fix reiserfs latencies caused by data=ordered]
 is included in your kernel or not.
 [data=ordered is default, and recommended].

also, not this is but speculation,
but it may well be that even the latest "fixes" don't fix the cause (some race
condition), but make the symptom go away by reducing the likelyhood of the race
to trigger. and then introduce drbd replication, and your broken nics, which
changes the latency and thus io-timing a lot, and suddenly the same (or some different)
race triggers again?

is changing file system type an option?
or changing kernel version?
would you be able to try this on a test cluster?

anyways, this starts to become off topic for a _drbd_ list...

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.