Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> -----Original Message----- > From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com] > Sent: Tuesday, 8 January 2013 10:24 PM > To: Paul Freeman > Cc: drbd-user at lists.linbit.com; Sebastian Riemer > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large > sustained writes and replication enabled to secondary > > On 07.01.2013 22:51, Paul Freeman wrote: > > I have come across a few references of similar behaviour on the net but > have not come across the solution(s) which appear relevant in my > situation. > > > > Any comments and suggestions would be welcome. > > The first thing to do is a "blktrace". I've already described how to do > this on this list. > > The interesting bit is how big your IOs are for regular data as well as > for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct > IO for this. > > If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely > high, then you know that something slows down the IOs. > Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14: > > http://git.drbd.org/gitweb.cgi?p=drbd- > 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554 > > Repeat the blktracing connected and disconnected. > > If you see 4 KiB IOs while connected, then it is an/the IO limits bug. Sebastian, Thank you for taking the time to read my post and provide some advice and information about the possible causes. I had come across blktrace but was not sure how to use it effectively. Thanks for the tips. I have performed some testing on the primary iscsi/drbd node monitoring I/O on the physical disc (/dev/sdb - RAID 10 3 TB configured as 3 partitions and 3 LVM volume groups) and the drbd device using blktrace for each test as follows: 1. use dd to write 100M directly to the LVM logical volume used as the backing device for the drbd device (/dev/drbd9) 2. use dd to write 100M to the resource drbd device (/dev/drbd9) with the drbd resource disconnected 3. use dd to write 100M to the resource drbd device (/dev/drbd9) with the drbd resource connected I used the following blktrace command: blktrace /dev/sdb /dev/drdb9 -b 4096 I used the following dd command: dd if=/dev/zero of=/dev/drbd9 oflag=direct bs=1M count=100 Results Summary The dd throughput results were as follows: Direct to LVM logical volume (disconnected - not that that actually matters) 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.133475 s, 786 MB/s Drbd resource disconnected- 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.134998 s, 777 MB/s Drbd resource connected- 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.710407 s, 148 MB/s These results are reproducible. I am not sure how to analyse and present the blktrace/blkparse results but as a first attempt (actually about the tenth attempt!) I have run "blkparse sdb drbd9 | egrep '147,9|8,16|8,17' | egrep '\[dd\]|\[0\]|<-' | grep -v '8,19'". I think this is showing the key information to be able to identify where the delays are occurring. 147,9 is the major,minor of /dev/drbd9 8,16 is the major,minor of /dev/sdb 8,17 is the major,minor number of the actual partition (/dev/sdb1) on the physical disk containing the LVM volume group and logical volume and drbd resource used in the testing. I have spent quite some time reviewing the blktrace data and believe I have a reasonable idea of what is going on now. I can provide the data off-list if required. I have shown below what I think are the key parts. Drbd resource disconnected- 147,9 0 1 21.005266905 4350 Q WS 0 + 256 [dd] 8,17 0 6236 21.005272149 4350 A WS 512 + 256 <- 8,16 0 6237 21.005272439 4350 A WS 2560 + 256 <- 8,16 0 6238 21.005272632 4350 Q WS 2560 + 256 [dd] 8,16 0 6239 21.005273752 4350 G WS 2560 + 256 [dd] 8,16 0 6240 21.005274750 4350 I WS 2560 + 256 [dd] 8,16 0 6241 21.005275586 4350 D WS 2560 + 256 [dd] 147,9 0 2 21.005287503 4350 Q WS 256 + 256 [dd] 8,17 0 6242 21.005289389 4350 A WS 768 + 256 <- 8,16 0 6243 21.005289500 4350 A WS 2816 + 256 <- 8,16 0 6244 21.005289633 4350 Q WS 2816 + 256 [dd] 8,16 0 6245 21.005290137 4350 G WS 2816 + 256 [dd] 8,16 0 6246 21.005290647 4350 I WS 2816 + 256 [dd] 8,16 0 6247 21.005291023 4350 D WS 2816 + 256 [dd] ... 147,9 0 8 21.005361065 4350 Q WS 1792 + 256 [dd] 8,17 0 6278 21.005362755 4350 A WS 2304 + 256 <- 8,16 0 6279 21.005362861 4350 A WS 4352 + 256 <- 8,16 0 6280 21.005362990 4350 Q WS 4352 + 256 [dd] 8,16 0 6281 21.005363326 4350 G WS 4352 + 256 [dd] 8,16 0 6282 21.005363760 4350 I WS 4352 + 256 [dd] 8,16 0 6283 21.005364039 4350 D WS 4352 + 256 [dd] 8,16 0 6284 21.005511623 0 C WS 2560 + 256 [0] 8,16 0 6285 21.005628122 0 C WS 2816 + 256 [0] 8,16 0 6286 21.005735922 0 C WS 3072 + 256 [0] 8,16 0 6287 21.005885404 0 C WS 3328 + 256 [0] 8,16 0 6288 21.005998357 0 C WS 3584 + 256 [0] 8,16 0 6289 21.006143985 0 C WS 3840 + 256 [0] 8,16 0 6290 21.006287354 0 C WS 4096 + 256 [0] 8,16 0 6291 21.006414761 0 C WS 4352 + 256 [0] 147,9 0 9 21.006599134 4350 Q WS 2048 + 256 [dd] ... Drbd resource connected- 147,9 4 1 137.254358367 29404 Q WS 0 + 256 [dd] 8,17 4 5858 137.254366442 29404 A WS 512 + 256 <- 8,16 4 5859 137.254366919 29404 A WS 2560 + 256 <- 8,16 4 5860 137.254367231 29404 Q WS 2560 + 256 [dd] 8,16 4 5861 137.254368957 29404 G WS 2560 + 256 [dd] 8,16 4 5862 137.254369927 29404 I WS 2560 + 256 [dd] 8,16 4 5863 137.254371124 29404 D WS 2560 + 256 [dd] 147,9 4 2 137.254383956 29404 Q WS 256 + 256 [dd] 8,17 4 5864 137.254386137 29404 A WS 768 + 256 <- 8,16 4 5865 137.254386280 29404 A WS 2816 + 256 <- 8,16 4 5866 137.254386413 29404 Q WS 2816 + 256 [dd] 8,16 4 5867 137.254386945 29404 G WS 2816 + 256 [dd] 8,16 4 5868 137.254387471 29404 I WS 2816 + 256 [dd] 8,16 4 5869 137.254387809 29404 D WS 2816 + 256 [dd] ... 147,9 4 8 137.254460623 29404 Q WS 1792 + 256 [dd] 8,17 4 5900 137.254462917 29404 A WS 2304 + 256 <- 8,16 4 5901 137.254463007 29404 A WS 4352 + 256 <- 8,16 4 5902 137.254463129 29404 Q WS 4352 + 256 [dd] 8,16 4 5903 137.254463480 29404 G WS 4352 + 256 [dd] 8,16 4 5904 137.254463927 29404 I WS 4352 + 256 [dd] 8,16 4 5905 137.254464193 29404 D WS 4352 + 256 [dd] 8,16 0 23502 137.254625838 0 C WS 2560 + 256 [0] 8,16 0 23503 137.254783835 0 C WS 2816 + 256 [0] 8,16 0 23504 137.254895659 0 C WS 3072 + 256 [0] 8,16 0 23505 137.255035266 0 C WS 3328 + 256 [0] 8,16 0 23506 137.255174766 0 C WS 3584 + 256 [0] 8,16 0 23507 137.255272516 0 C WS 3840 + 256 [0] 8,16 0 23508 137.255405754 0 C WS 4096 + 256 [0] 8,16 0 23509 137.255536862 0 C WS 4352 + 256 [0] 147,9 4 9 137.263222933 29404 Q WS 2048 + 256 [dd] ... Analysis: 1. In both connected and disconnected cases, the write sizes used by DRBD are 128KiB so I think this is OK. At least they are not 4KiB which would indicate the IO problem you mention. 2. In the connected case there is a significant delay after the last IO completion action (C) before the next dd IO occurs. ie 0.007686071 seconds for connected vs 0.000184373 seconds for disconnected. This behaviour is reproducible and is seen after every sequence of IO completion actions for the entire running of dd. >From my understanding of the blktrace data, the IO completion action occurs every 1MB. My hunch is this delay is caused by network latency ie. transferring the data to the network layer and then across to the secondary DRBD node via the bonded interface. Also, given Protocol C is being used, I understand IO blocking occurs until the data is actually written to disc (or controller BBU cache) so perhaps the secondary is responsible in this case? If my analysis is correct then DRBD is not responsible for the delays. Does this sound sensible and reasonable? If so, how can I confirm this is the case? Is there a tool like blktrace for use on network IO? Given Ubuntu 12.04 LTS is using kernel 3.2.0-XX is it worth disabling the intel_idle cpuidle driver as described on this list by Shaun Thomas to decrease latency? http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0121.html&month=2012-10 Comments and suggestions welcome. Regards Paul > -----Original Message----- > From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com] > Sent: Tuesday, 8 January 2013 10:24 PM > To: Paul Freeman > Cc: drbd-user at lists.linbit.com; Sebastian Riemer > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large > sustained writes and replication enabled to secondary > > On 07.01.2013 22:51, Paul Freeman wrote: > > I have come across a few references of similar behaviour on the net but > have not come across the solution(s) which appear relevant in my > situation. > > > > Any comments and suggestions would be welcome. > > The first thing to do is a "blktrace". I've already described how to do > this on this list. > > The interesting bit is how big your IOs are for regular data as well as > for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct > IO for this. > > If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely > high, then you know that something slows down the IOs. > Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14: > > http://git.drbd.org/gitweb.cgi?p=drbd- > 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554 > > Repeat the blktracing connected and disconnected. > > If you see 4 KiB IOs while connected, then it is an/the IO limits bug. > > The dumb thing is that DRBD has a dynamic IO limits detection which > resets the IO limits to 4 KiB if you loose connection to the secondary. > This is why people need a caching RAID controller which merges and fixes > such crappy behavior and can't use an HBA with SW RAID. > > Also with a really fast transport like QDR InfiniBand (40 Gbit/s) you'll > see that DRBD introduces lots of latency. Can't get more than approx. > 250 MiB/s although the storage server can do > 600 MiB/s and the > transport can do 4 GiB/s at < 1 us latency. > > This is why we enhanced MD RAID-1 for replication. MD is much more > stable, gives up to 512 KiB IOs, has a very intelligent write-intent > bitmap and is high performance replication. As transport we've got SRP > which is RDMA only. With replication separated from the transport we can > also do better blktracing. And we have the same latency for both RAID-1 > disks as we have them both on SRP remote storage. > > Cheers, > Sebastian > > -- > Sebastian Riemer > Linux Kernel Developer - Storage > > We are looking for (SENIOR) LINUX KERNEL DEVELOPERS! > > ProfitBricks GmbH * Greifswalder Str. 207 * 10405 Berlin, Germany > www.profitbricks.com * sebastian.riemer at profitbricks.com