[DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Wed Jan 9 04:51:27 CET 2013

> -----Original Message-----
> From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com]
> Sent: Tuesday, 8 January 2013 10:24 PM
> To: Paul Freeman
> Cc: drbd-user at lists.linbit.com; Sebastian Riemer
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 07.01.2013 22:51, Paul Freeman wrote:
> > I have come across a few references of similar behaviour on the net but
> have not come across the solution(s) which appear relevant in my
> situation.
> >
> > Any comments and suggestions would be welcome.
> 
> The first thing to do is a "blktrace". I've already described how to do
> this on this list.
> 
> The interesting bit is how big your IOs are for regular data as well as
> for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct
> IO for this.
> 
> If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely
> high, then you know that something slows down the IOs.
> Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14:
> 
> http://git.drbd.org/gitweb.cgi?p=drbd-
> 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554
> 
> Repeat the blktracing connected and disconnected.
> 
> If you see 4 KiB IOs while connected, then it is an/the IO limits bug.

Sebastian,
Thank you for taking the time to read my post and provide some advice and information about the possible causes.

I had come across blktrace but was not sure how to use it effectively.  Thanks for the tips.

I have performed some testing on the primary iscsi/drbd node monitoring 
I/O on the physical disc (/dev/sdb - RAID 10 3 TB configured as 3 partitions and 3 LVM volume groups) and the drbd device using blktrace for each test as follows:

	1. use dd to write 100M directly to the LVM logical volume used as the backing device for the drbd device (/dev/drbd9)
	2. use dd to write 100M to the resource drbd device (/dev/drbd9) with the drbd resource disconnected
 	3. use dd to write 100M to the resource drbd device (/dev/drbd9) with the drbd resource connected

I used the following blktrace command:
blktrace /dev/sdb /dev/drdb9 -b 4096

I used the following dd command:
dd if=/dev/zero of=/dev/drbd9 oflag=direct bs=1M count=100

Results Summary

The dd throughput results were as follows:

Direct to LVM logical volume (disconnected - not that that actually matters)

100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.133475 s, 786 MB/s

Drbd resource disconnected-
	100+0 records in
	100+0 records out
	104857600 bytes (105 MB) copied, 0.134998 s, 777 MB/s

Drbd resource connected-
	100+0 records in
	100+0 records out
	104857600 bytes (105 MB) copied, 0.710407 s, 148 MB/s

These results are reproducible.

I am not sure how to analyse and present the blktrace/blkparse results but as a first attempt (actually about the tenth attempt!) I have run 
"blkparse sdb drbd9 | egrep '147,9|8,16|8,17' | egrep '\[dd\]|\[0\]|<-' | grep -v '8,19'".

I think this is showing the key information to be able to identify where the delays are occurring.

147,9 is the major,minor of /dev/drbd9
8,16 is the major,minor of /dev/sdb
8,17 is the major,minor number of the actual partition (/dev/sdb1) on the physical disk containing the LVM volume group and logical volume and drbd resource used in the testing.

I have spent quite some time reviewing the blktrace data and believe I have a reasonable idea of what is going on now.  I can provide the data off-list if required.  I have shown below what I think are the key parts.

Drbd resource disconnected-

147,9	0	1	21.005266905	4350	Q	WS	0	+	256	[dd]
8,17	0	6236	21.005272149	4350	A	WS	512	+	256	<-
8,16	0	6237	21.005272439	4350	A	WS	2560	+	256	<-
8,16	0	6238	21.005272632	4350	Q	WS	2560	+	256	[dd]
8,16	0	6239	21.005273752	4350	G	WS	2560	+	256	[dd]
8,16	0	6240	21.005274750	4350	I	WS	2560	+	256	[dd]
8,16	0	6241	21.005275586	4350	D	WS	2560	+	256	[dd]
147,9	0	2	21.005287503	4350	Q	WS	256	+	256	[dd]
8,17	0	6242	21.005289389	4350	A	WS	768	+	256	<-
8,16	0	6243	21.005289500	4350	A	WS	2816	+	256	<-
8,16	0	6244	21.005289633	4350	Q	WS	2816	+	256	[dd]
8,16	0	6245	21.005290137	4350	G	WS	2816	+	256	[dd]
8,16	0	6246	21.005290647	4350	I	WS	2816	+	256	[dd]
8,16	0	6247	21.005291023	4350	D	WS	2816	+	256	[dd]
...
147,9	0	8	21.005361065	4350	Q	WS	1792	+	256	[dd]
8,17	0	6278	21.005362755	4350	A	WS	2304	+	256	<-
8,16	0	6279	21.005362861	4350	A	WS	4352	+	256	<-
8,16	0	6280	21.005362990	4350	Q	WS	4352	+	256	[dd]
8,16	0	6281	21.005363326	4350	G	WS	4352	+	256	[dd]
8,16	0	6282	21.005363760	4350	I	WS	4352	+	256	[dd]
8,16	0	6283	21.005364039	4350	D	WS	4352	+	256	[dd]
8,16	0	6284	21.005511623	0	C	WS	2560	+	256	[0]
8,16	0	6285	21.005628122	0	C	WS	2816	+	256	[0]
8,16	0	6286	21.005735922	0	C	WS	3072	+	256	[0]
8,16	0	6287	21.005885404	0	C	WS	3328	+	256	[0]
8,16	0	6288	21.005998357	0	C	WS	3584	+	256	[0]
8,16	0	6289	21.006143985	0	C	WS	3840	+	256	[0]
8,16	0	6290	21.006287354	0	C	WS	4096	+	256	[0]
8,16	0	6291	21.006414761	0	C	WS	4352	+	256	[0]
147,9	0	9	21.006599134	4350	Q	WS	2048	+	256	[dd]
...

Drbd resource connected-

147,9	4	1	137.254358367	29404	Q	WS	0	+	256	[dd]
8,17	4	5858	137.254366442	29404	A	WS	512	+	256	<-
8,16	4	5859	137.254366919	29404	A	WS	2560	+	256	<-
8,16	4	5860	137.254367231	29404	Q	WS	2560	+	256	[dd]
8,16	4	5861	137.254368957	29404	G	WS	2560	+	256	[dd]
8,16	4	5862	137.254369927	29404	I	WS	2560	+	256	[dd]
8,16	4	5863	137.254371124	29404	D	WS	2560	+	256	[dd]
147,9	4	2	137.254383956	29404	Q	WS	256	+	256	[dd]
8,17	4	5864	137.254386137	29404	A	WS	768	+	256	<-
8,16	4	5865	137.254386280	29404	A	WS	2816	+	256	<-
8,16	4	5866	137.254386413	29404	Q	WS	2816	+	256	[dd]
8,16	4	5867	137.254386945	29404	G	WS	2816	+	256	[dd]
8,16	4	5868	137.254387471	29404	I	WS	2816	+	256	[dd]
8,16	4	5869	137.254387809	29404	D	WS	2816	+	256	[dd]
...
147,9	4	8	137.254460623	29404	Q	WS	1792	+	256	[dd]
8,17	4	5900	137.254462917	29404	A	WS	2304	+	256	<-
8,16	4	5901	137.254463007	29404	A	WS	4352	+	256	<-
8,16	4	5902	137.254463129	29404	Q	WS	4352	+	256	[dd]
8,16	4	5903	137.254463480	29404	G	WS	4352	+	256	[dd]
8,16	4	5904	137.254463927	29404	I	WS	4352	+	256	[dd]
8,16	4	5905	137.254464193	29404	D	WS	4352	+	256	[dd]
8,16	0	23502	137.254625838	0	C	WS	2560	+	256	[0]
8,16	0	23503	137.254783835	0	C	WS	2816	+	256	[0]
8,16	0	23504	137.254895659	0	C	WS	3072	+	256	[0]
8,16	0	23505	137.255035266	0	C	WS	3328	+	256	[0]
8,16	0	23506	137.255174766	0	C	WS	3584	+	256	[0]
8,16	0	23507	137.255272516	0	C	WS	3840	+	256	[0]
8,16	0	23508	137.255405754	0	C	WS	4096	+	256	[0]
8,16	0	23509	137.255536862	0	C	WS	4352	+	256	[0]
147,9	4	9	137.263222933	29404	Q	WS	2048	+	256	[dd]
...

Analysis:
1. In both connected and disconnected cases, the write sizes used by DRBD are 128KiB so I think this is OK.  At least they are not 4KiB which would indicate the IO problem you mention.

2. In the connected case there is a significant delay after the last IO completion action (C) before the next dd IO occurs.  ie 0.007686071 seconds for connected vs 0.000184373 seconds for disconnected.
This behaviour is reproducible and is seen after every sequence of IO completion actions for the entire running of dd.

>From my understanding of the blktrace data, the IO completion action occurs every 1MB.

My hunch is this delay is caused by network latency ie. transferring the data to the network layer and then across to the secondary DRBD node via the bonded interface.  Also, given Protocol C is being used, I understand IO blocking occurs until the data is actually written to disc (or controller BBU cache) so perhaps the secondary is responsible in this case?

If my analysis is correct then DRBD is not responsible for the delays.

Does this sound sensible and reasonable?  If so, how can I confirm this is the case?  Is there a tool like blktrace for use on network IO?

Given Ubuntu 12.04 LTS is using kernel 3.2.0-XX is it worth disabling the intel_idle cpuidle driver as described on this list by Shaun Thomas to decrease latency?

http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0121.html&month=2012-10

Comments and suggestions welcome.

Regards

Paul
> -----Original Message-----
> From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com]
> Sent: Tuesday, 8 January 2013 10:24 PM
> To: Paul Freeman
> Cc: drbd-user at lists.linbit.com; Sebastian Riemer
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 07.01.2013 22:51, Paul Freeman wrote:
> > I have come across a few references of similar behaviour on the net but
> have not come across the solution(s) which appear relevant in my
> situation.
> >
> > Any comments and suggestions would be welcome.
> 
> The first thing to do is a "blktrace". I've already described how to do
> this on this list.
> 
> The interesting bit is how big your IOs are for regular data as well as
> for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct
> IO for this.
> 
> If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely
> high, then you know that something slows down the IOs.
> Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14:
> 
> http://git.drbd.org/gitweb.cgi?p=drbd-
> 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554
> 
> Repeat the blktracing connected and disconnected.
> 
> If you see 4 KiB IOs while connected, then it is an/the IO limits bug.
> 
> The dumb thing is that DRBD has a dynamic IO limits detection which
> resets the IO limits to 4 KiB if you loose connection to the secondary.
> This is why people need a caching RAID controller which merges and fixes
> such crappy behavior and can't use an HBA with SW RAID.
> 
> Also with a really fast transport like QDR InfiniBand (40 Gbit/s) you'll
> see that DRBD introduces lots of latency. Can't get more than approx.
> 250 MiB/s although the storage server can do > 600 MiB/s and the
> transport can do 4 GiB/s at < 1 us latency.
> 
> This is why we enhanced MD RAID-1 for replication. MD is much more
> stable, gives up to 512 KiB IOs, has a very intelligent write-intent
> bitmap and is high performance replication. As transport we've got SRP
> which is RDMA only. With replication separated from the transport we can
> also do better blktracing. And we have the same latency for both RAID-1
> disks as we have them both on SRP remote storage.
> 
> Cheers,
> Sebastian
> 
> --
> Sebastian Riemer
> Linux Kernel Developer - Storage
> 
> We are looking for (SENIOR) LINUX KERNEL DEVELOPERS!
> 
> ProfitBricks GmbH * Greifswalder Str. 207 * 10405 Berlin, Germany
> www.profitbricks.com * sebastian.riemer at profitbricks.com