Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Andy, I don't think the network bandwidth between the two iscsi/drbd nodes is the limiting factor. I have changed to a balanced round-robin bond instead of active-passive and testing with iperf gives about 1.8 Gigabits/sec. However, the high io wait remains the same as with the bond in active-passive mode. I have the iscsi peer connected via a 1 Gbps active-passive bond to the lan (and thus the iscsi/drbd nodes). The highest spped transfer I have seen during a restore (Proxmox qmrestore) to the primary iscsi/drbd node in the disconnected state is about 110Megabits/s. Thanks for the suggestion though - I appreciate it. I tried lowering the unplug-watermark from the default (128) to 16. This seemed to make a difference in the first test. The io wait did reach 99% but it then dropped back before climbing again then dropping back. This was repeated. However, when I tested it the second time the iscsi connection timed out a couple of minutes into the restore and the iscsi target was disconnected on that Proxmox node :-( It was necessary to delete that nodes iscsi session to clear the problem. Thanks again Paul > -----Original Message----- > From: Andy Dills [mailto:andy at xecu.net] > Sent: Wednesday, 16 January 2013 3:15 AM > To: Paul Freeman > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large > sustained writes and replication enabled to secondary > > > Paul, > > Is it possible you're being constrained by the bandwidth connecting your > two drbd servers? > > As I learned in my thread, writes to the local volume are constrained by > the bandwidth between it and the connected volume...basically it sounds > like 10gbs is the only way to go. > > Not sure it totally explains the iowait situation, but it's worth > experimenting with...for example I changed my bond method to balance-rr > and was able to get over 200MBps on dd writes. I'm curious if you > increased the bandwidth between the drbd nodes even by that much, if it > would change your results. > > Andy > > On Tue, 15 Jan 2013, Paul Freeman wrote: > > > Sorry for replying to my own email but my initial thoughts that I had > resolved the high IO wait issue have proved to be wrong. > > > > It seems the testing I did on a test iscsi target created on the > iscsi/drbd servers does not perform the same as a production iscsi target. > Not sure why at this stage apart from the fact that it is giving me best > case results as nothing else is using the target compared with the > production ones. > > > > Back to testing. > > > > Apologies for the mis-information. > > > > Regards > > > > Paul > > > > > -----Original Message----- > > > From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user- > > > bounces at lists.linbit.com] On Behalf Of Paul Freeman > > > Sent: Tuesday, 15 January 2013 12:54 PM > > > To: Sebastian Riemer > > > Cc: drbd-user at lists.linbit.com > > > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large > > > sustained writes and replication enabled to secondary > > > > > > > > > Sebastian, > > > Thanks for your analyses - I really appreciate it. I have taken a > little > > > while to reply as I wanted to complete a dd from the the iscsi peer to > the > > > primary iscsi node and run blktrace on the iscsi peer. > > > > > > My comments are inline. > > > > > > Regards > > > > > > Paul > > > > -----Original Message----- > > > > From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com] > > > > Sent: Friday, 11 January 2013 10:08 PM > > > > To: Paul Freeman > > > > Cc: drbd-user at lists.linbit.com > > > > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large > > > > sustained writes and replication enabled to secondary > > > > > > > > On 11.01.2013 04:36, Paul Freeman wrote: > > > > > Sebastian, > > > > > I have collected some tcpdump logs on the internode link used for > drbd > > > > while writing 100M to the primary drbd node using dd (as before) and > > > with > > > > the relevant drbd resource connected. > > > > > > > > > > I am not very proficient at interpreting the tcpdump traffic so > please > > > > bear with me:-) > > > > > > > > > > I ran tcpdump with the options "-vvv -s0 tcp port 7798". Port > 7798 is > > > > the port used by this drbd resource. I subsequently used wireshark > to > > > > identify the tcp stream associated with the replication traffic and > set > > > an > > > > appropriate filter to extract those packets and save them to a > separate > > > > packet capture file. > > > > > > > > > > My first guess is the segment ACK RTT may be relevant to our > > > > investigation. When I extract the segment ACKs from the secondary > node > > > > using tshark and tcp.analysis.ack_rtt then process them in MS Excel, > the > > > > results are as follows: > > > > > > > > > > Number of segment ACKs: 5255 > > > > > Average RTT: 1.87ms > > > > > Minimum RTT: 0.078ms > > > > > Maximum RTT: 41.45ms (actually the last ACK) > > > > > > > > > > The time for dd to complete was 950ms. > > > > > > > > > > Does this provide any information on the latency problem? > > > > > > > > Better would be writing locally on the primary. So that we only see > the > > > > network latency of the inter-node connection and not of the iSCSI > > > > connection as well. > > > > > > > > > > The numbers I provided were for writing to the iscsitarget from the > > > iscsitarget server itself so they exclude the iscsi network latancy. > > > > > > > I had a deeper look at your blktrace output. > > > > > > > > 1 MB from queuing the first 128 KiB chunk to queuing the 9th 128 KiB > > > > chunk on the DRBD device lasts: > > > > > > > > discon connec > > > > 1.33 ms 8.86 ms > > > > > > > > Difference: 7.53 ms. > > > > > > > > Also the latency between queuing (Q) the first IO on the DRBD device > and > > > > dispatching (D) it on the SCSI disk below changed. This is the > change of > > > > DRBD latency. > > > > > > > > discon connec > > > > 8.7 us 12.8 us > > > > > > > > 4.1 us * 8 = 32.8 us. This is negligible. > > > > > > > > So the time the packets need in the network layer should be > something < > > > > 7.5 ms until 1 MB is transferred. This is the additional time which > > > > hurts you when you are connected with DRBD. > > > > > > > > Please do a blktrace on the iSCSI client for disconnected and > connected > > > > DRBD primary. Then, we'll see which of the two chained network > > > > connections is faster. > > > > > > > > This is what we know for 1 MB, now: > > > > iSCSI path: ? ms > > > > SCSI/disk: 1.33 ms > > > > between DRBD nodes: 7.53 ms > > > > > > > > Overall latency = iSCSI path + SCSI/disk + between DRBD nodes. > > > > > > > > Cheers, > > > > Sebastian > > > > > > I have performed some more testing using dd and blktrace from the > iscsi > > > peer which is connected to the iscsi target (drbd primary node) (same > > > target as in previous tests) via the lan using a bonded network > connection > > > of 2 1Gbps adaptors in an active-passive bond so I would expect max > > > throughput of approx 800-900Mbps. > > > > > > The 2 iscsi/drbd nodes are at present connected by a similar network > > > bonding arrangement so again tcp network throughput will be approx. > 800- > > > 900Mbps. > > > > > > I tested with the drbd resource connected and disconnected. The raw > dd > > > results were as follows. > > > > > > Resource connected: > > > root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1 > > > oflag=direct bs=1M count=100 > > > 100+0 records in > > > 100+0 records out > > > 104857600 bytes (105 MB) copied, 0.983008 s, 107 MB/s > > > > > > Resource disoonnected: > > > root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1 > > > oflag=direct bs=1M count=100 > > > 100+0 records in > > > 100+0 records out > > > 104857600 bytes (105 MB) copied, 0.980041 s, 107 MB/s > > > > > > The blktrace data, for the resource connected state is presented in a > > > similar manner to my earlier email. > > > > > > dd and blktrace with resource connected: > > > > > > 253,12 6 1 0.000000000 949945 Q WS 0 + 1024 [dd] > > > 8,64 6 1 0.000010264 949945 A WS 2048 + 1024 <- > (253,12) > > > 0 > > > 8,64 6 2 0.000012718 949945 Q WS 2048 + 1024 [dd] > > > 8,64 6 3 0.000017482 949945 G WS 2048 + 1024 [dd] > > > 8,64 6 4 0.000020597 949945 P N [dd] > > > 8,64 6 5 0.000021889 949945 I WS 2048 + 1024 [dd] > > > 253,12 6 2 0.000053322 949945 Q WS 1024 + 1024 [dd] > > > 8,64 6 6 0.000054656 949945 A WS 3072 + 1024 <- > (253,12) > > > 1024 > > > 8,64 6 7 0.000055258 949945 Q WS 3072 + 1024 [dd] > > > 8,64 6 8 0.000056680 949945 G WS 3072 + 1024 [dd] > > > 8,64 6 9 0.000057596 949945 I WS 3072 + 1024 [dd] > > > 253,12 6 3 0.000059072 949945 U N [dd] 0 > > > 8,64 6 10 0.000059508 949945 U N [dd] 2 > > > 8,64 6 11 0.000062659 949945 D WS 2048 + 1024 [dd] > > > 8,64 6 12 0.000083516 949945 D WS 3072 + 1024 [dd] > > > 8,64 4 1 0.008451596 0 C WS 2048 + 1024 [0] > > > 253,12 4 1 0.008455216 0 C WS 0 + 1024 [0] > > > 8,64 4 2 0.010099424 0 C WS 3072 + 1024 [0] > > > 253,12 4 2 0.010101816 0 C WS 1024 + 1024 [0] > > > 253,12 4 3 0.010343363 949945 Q WS 2048 + 1024 [dd] > > > 8,64 4 3 0.010345313 949945 A WS 4096 + 1024 <- > (253,12) > > > 2048 > > > 8,64 4 4 0.010346379 949945 Q WS 4096 + 1024 [dd] > > > 8,64 4 5 0.010347427 949945 G WS 4096 + 1024 [dd] > > > 8,64 4 6 0.010349276 949945 P N [dd] > > > 8,64 4 7 0.010349774 949945 I WS 4096 + 1024 [dd] > > > 253,12 4 4 0.010379249 949945 Q WS 3072 + 1024 [dd] > > > 8,64 4 8 0.010380159 949945 A WS 5120 + 1024 <- > (253,12) > > > 3072 > > > 8,64 4 9 0.010380787 949945 Q WS 5120 + 1024 [dd] > > > 8,64 4 10 0.010381797 949945 G WS 5120 + 1024 [dd] > > > 8,64 4 11 0.010382719 949945 I WS 5120 + 1024 [dd] > > > 253,12 4 5 0.010383437 949945 U N [dd] 0 > > > 8,64 4 12 0.010383737 949945 U N [dd] 2 > > > 8,64 4 13 0.010384935 949945 D WS 4096 + 1024 [dd] > > > 8,64 4 14 0.010394358 949945 D WS 5120 + 1024 [dd] > > > 8,64 4 15 0.018110846 0 C WS 4096 + 1024 [0] > > > 253,12 4 6 0.018112921 0 C WS 2048 + 1024 [0] > > > 8,64 4 16 0.019763115 0 C WS 5120 + 1024 [0] > > > 253,12 4 7 0.019765293 0 C WS 3072 + 1024 [0] > > > 253,12 4 8 0.020006270 949945 Q WS 4096 + 1024 [dd] > > > 8,64 4 17 0.020007928 949945 A WS 6144 + 1024 <- > (253,12) > > > 4096 > > > 8,64 4 18 0.020008906 949945 Q WS 6144 + 1024 [dd] > > > 8,64 4 19 0.020010114 949945 G WS 6144 + 1024 [dd] > > > 8,64 4 20 0.020011824 949945 P N [dd] > > > 8,64 4 21 0.020012472 949945 I WS 6144 + 1024 [dd] > > > 253,12 4 9 0.020042173 949945 Q WS 5120 + 1024 [dd] > > > > > > dd and blktrace with resource disconnected: > > > > > > 253,12 6 1 0.000000000 935502 Q WS 0 + 1024 [dd] > > > 8,64 6 1 0.000005862 935502 A WS 2048 + 1024 <- > (253,12) > > > 0 > > > 8,64 6 2 0.000008418 935502 Q WS 2048 + 1024 [dd] > > > 8,64 6 3 0.000012204 935502 G WS 2048 + 1024 [dd] > > > 8,64 6 4 0.000015305 935502 P N [dd] > > > 8,64 6 5 0.000017047 935502 I WS 2048 + 1024 [dd] > > > 253,12 6 2 0.000049506 935502 Q WS 1024 + 1024 [dd] > > > 8,64 6 6 0.000052540 935502 A WS 3072 + 1024 <- > (253,12) > > > 1024 > > > 8,64 6 7 0.000053122 935502 Q WS 3072 + 1024 [dd] > > > 8,64 6 8 0.000055384 935502 G WS 3072 + 1024 [dd] > > > 8,64 6 9 0.000056396 935502 I WS 3072 + 1024 [dd] > > > 253,12 6 3 0.000057981 935502 U N [dd] 0 > > > 8,64 6 10 0.000058619 935502 U N [dd] 2 > > > 8,64 6 11 0.000061829 935502 D WS 2048 + 1024 [dd] > > > 8,64 6 12 0.000082332 935502 D WS 3072 + 1024 [dd] > > > 8,64 4 1 0.008042208 902668 C WS 2048 + 1024 [0] > > > 253,12 4 1 0.008045714 902668 C WS 0 + 1024 [0] > > > 8,64 4 2 0.009682479 0 C WS 3072 + 1024 [0] > > > 253,12 4 2 0.009684697 0 C WS 1024 + 1024 [0] > > > 253,12 4 3 0.009935296 935502 Q WS 2048 + 1024 [dd] > > > 8,64 4 3 0.009937327 935502 A WS 4096 + 1024 <- > (253,12) > > > 2048 > > > 8,64 4 4 0.009938885 935502 Q WS 4096 + 1024 [dd] > > > 8,64 4 5 0.009940089 935502 G WS 4096 + 1024 [dd] > > > 8,64 4 6 0.009941995 935502 P N [dd] > > > 8,64 4 7 0.009942455 935502 I WS 4096 + 1024 [dd] > > > 253,12 4 4 0.009972506 935502 Q WS 3072 + 1024 [dd] > > > 8,64 4 8 0.009973408 935502 A WS 5120 + 1024 <- > (253,12) > > > 3072 > > > 8,64 4 9 0.009974076 935502 Q WS 5120 + 1024 [dd] > > > 8,64 4 10 0.009975116 935502 G WS 5120 + 1024 [dd] > > > 8,64 4 11 0.009976072 935502 I WS 5120 + 1024 [dd] > > > 253,12 4 5 0.009976980 935502 U N [dd] 0 > > > 8,64 4 12 0.009977378 935502 U N [dd] 2 > > > 8,64 4 13 0.009978559 935502 D WS 4096 + 1024 [dd] > > > 8,64 4 14 0.009986557 935502 D WS 5120 + 1024 [dd] > > > 8,64 4 15 0.017681675 0 C WS 4096 + 1024 [0] > > > 253,12 4 6 0.017683305 0 C WS 2048 + 1024 [0] > > > 8,64 4 16 0.019457154 0 C WS 5120 + 1024 [0] > > > 253,12 4 7 0.019458808 0 C WS 3072 + 1024 [0] > > > 253,12 4 8 0.019700743 935502 Q WS 4096 + 1024 [dd] > > > 8,64 4 17 0.019703005 935502 A WS 6144 + 1024 <- > (253,12) > > > 4096 > > > 8,64 4 18 0.019704175 935502 Q WS 6144 + 1024 [dd] > > > 8,64 4 19 0.019705315 935502 G WS 6144 + 1024 [dd] > > > 8,64 4 20 0.019706887 935502 P N [dd] > > > 8,64 4 21 0.019707887 935502 I WS 6144 + 1024 [dd] > > > 253,12 4 9 0.019738113 935502 Q WS 5120 + 1024 [dd] > > > > > > Running dd from the issci peer shows it is using a 512KiB block size > > > across the link to the iscsi target. > > > > > > Please let me know if you require any further information. > > > > > > Importantly, I believe I have discovered the cause of the high IO wait > I > > > was seeing during the sustained writing of a larger (40GB) file from > the > > > iscsi peer to the iscsi target (drbd primary node). > > > > > > During searching on Google for latency in kernel 3.2 I came across a > > > number of postings where the intel_idle cpuidle driver was mentioned > as > > > causing some latency as a result of putting the processor(s) into > sleep > > > states > C1. When I checked using the i7z tool from Google Code > > > (http://code.google.com/p/i7z/) the processors were spending a > significant > > > % of time in the C3 and C6 states. This was even though I had > disabled > > > Intel Speedstep and other power saving features in the BIOS. As per > > > http://stackoverflow.com/questions/12111954/context-switches-much- > slower- > > > in-new-linux-kernels > > > I added the recommend kernel options to disable the intel_idle driver > as > > > follows: > > > > > > intel_idle.max_cstate=0 processor.max_cstate=1 idle=mwait > > > > > > After rebooting and performing the restore from the iscsi peer to the > > > iscsi target (drbd primary) IO wait was < 2% compared with >99% across > all > > > processors prior to the change. > > > > > > Thanks again > > > > > > Paul > > > _______________________________________________ > > > drbd-user mailing list > > > drbd-user at lists.linbit.com > > > http://lists.linbit.com/mailman/listinfo/drbd-user > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > > > --- > Andy Dills > Xecunet, Inc. > www.xecu.net > 301-682-9972 > ---