Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello again! I did some further testing and performance tuning. I was able to rise the average throughput a bit with large max-buffers and changing the i/o scheduler but the drbd throughput is still way behind the max throughput of the raw devices, while i/o wait is still going through the roof. I think this seems to be related with the 4k writes on the secondary. While drbd accumulates writes on the primary to larger block i/o to the raw disk it does not on the secondary. All i/o to the raw disk on the secondary is 8 block or 4kbytes large. That is hitting the storage on the secondary with up to 100000 4k i/os per second and it seems that the raid card limit is kicking in somewhere around this. Doing additional testing with iostat induces that the disk utilisation on the secondary is up to 99% when doing large bulk writes on the primary where disk utilisation is only about 60% with the same amount of data being written at the same time. When drbd does a sync or resync it transmits larger blocks to the secondary and I am getting up to 600Mbyte/s write speed on the secondary (which sounds quite okay for a backing device that is able to put somehwere aroung 700mbyte/s). Swapping the roles for a test device did not do anything- the situation is just the other way round. All i/o on the secondary is 4kbytes large during ongoing replication and quite larger during sync. This is how it ALWAYS looks like during ongoing replication. (Notice the avgrq-sz of 8 blocks): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 18.50 0.00 0.07 8.00 0.00 0.11 0.00 0.11 0.11 0.20 Interestingly the i/o wait of the secondary is not rising a bit (not over 1% in any situation) but on the primary a lot (as I said up to 60% of 8 cores when doing large sequential operation)- although the disks are nearly fully utilized when doing large sequential writes (seems drbd has some kind of throtteling mechanism here?). So I built a test setup using two virtual machines and built drbd 8.4.3 from source for this, running on kernel 3.2.35 too. Interestingly in my test setup the situation is different. Here block i/o on the secondary raw device has different sizes, not only 8k. Of course I can't do any serious performance benchmark on two virtual machines. Can anyone confirm this behaviour? Can anyone elaborate why 1. drbd splits up all i/o into 4k i/o (possibly related with the intention to put any i/o into a single jumbo frame?) 2. in drbd 8.3.11 (at least on my hardware) all block i/o hits the secondary backend device with 4k block size? 3. drbd 8.4.3 behaviour seems to be different (is it actually different or just dependent on the different (virtual) hw?) As far as I found out googling drbd should normally determine the maximum block i/o size from the underlying hw /sys/block/<device>/queue/max_hw_sectors_kb But my hw is actually showing a value of 512 for that which indicates it will accept block i/o up to a size of 512kb - if I am right. The drbd devices show a value of 128k for that which seems to do actually nothing as any write request to the drbd device is issued in a 4k manner. (This is the same behaviour on my 8.4 test setup but here the writes seem to be accumulated on the secondary just like they are on the primary) Thank you all in advance, regards, Felix > -----Ursprüngliche Nachricht----- > Von: drbd-user-bounces at lists.linbit.com [mailto:drbd-user- > bounces at lists.linbit.com] Im Auftrag von Felix Zachlod > Gesendet: Freitag, 15. März 2013 08:32 > An: drbd-user at lists.linbit.com > Betreff: [DRBD-user] High I/O Wait when cluster is connected (and reduced > throughput) > > Hello list, > > We are currently investigating a performance issue with our DRBD cluster. > We have recently upgraded the hardware in the cluster (added new and > additional raid cards, SSD caching and a lot of spindles) and experiencing the > following problem: > > When our secondary node is connected to the cluster this leads to a > dramatical performance drop in terms of time the primary spends in i/o wait > and a dramatically reduced throughput to the backing devices. > > Before upgrading our cluster hardware we just thought that high i/o wait is > due to our backing devices hitting the ground but that can't be the reason > according to my tests. > > We currently have 4 raid sets on each cluster node as backing devices > connected to two LSI 9280 raid cards (in each cluster node) with BBWC and > FastPath and CacheCade option. CacheCade is turned on for this tests and > the CacheCade disks are Intel 520 SSD running in raid 1. Each controller has a > ssd cache like this. > > Even if there is abolutely no load on the cluster we obverse the following: > > When doing a single sequential write to the primary node to one of the raid > sets the i/o wait of the cluster node rises to about 40-50 % of the cpu time > where the secondary is sitting IDLE (i/o wait 0.2 percent on here). Notice > these are 8 core machines so 50% i/o wait means 4 cores or 4 threads waiting > for the block device all the time. Throughput drops from ~ 450Mbyte/s to 200 > Mbyte/s opposed to the situation where we take down the corresponding > drbd device on the secondary. > > If the drbd device is running in stand alone mode on the primary the i/o wait > is as low as ~ 10 - 15 percent which I assume is normal behaviour when one > sequential write is hitting a block device at max rate. We first thought that > this might be an issue with our scst configuration but this also happens if we > do a test locally on the cluster node. > > The cluster nodes are connected with a 10GBE link in a back-2-back fashion. > The measured RTT is about 100us and the measured TCP bandwidth is 9.6 > Gbps. > As I said the backing device on the secondary is just sitting there bored- the > i/o wait on the secondary is about 0.2 percent. We already rose the al- > extends parameter as I read that it could be a cause for performance issues if > drbd has to do frequent meta data updates. > > This is the current config of one of our drbd- devices: > > resource XXX { > > device /dev/drbd3; > disk /dev/disk/by-id/scsi-3600605b00494046018bf37719f7e1786-part1; > meta-disk internal; > > net { > max-buffers 16384; > unplug-watermark 32; > max-epoch-size 16384; > sndbuf-size 1024k; > } > > syncer { > rate 200M; > al-extents 3833; > } > > disk { > no-disk-barrier; > no-disk-flushes; > } > > on node-a { > address 10.1.200.1:7792; > } > > on node-b { > address 10.1.200.2:7792; > } > } > > Here is a screenshot showing a single sequential write to the device (the > primary is on the left): http://imageshack.us/f/823/drbd.png/. By the way > can anyone possibly elaborate why the tps is so much higher on the > secondary? > > There are 16GB mem in each node and this device whe are talking about is a > 12 spindle raid 6 with ssd caching (read and write). We aleady tried disabling > ssd cache but this is even getting worse. Although the cluster is STILL in good > responsiveness most oft he time this is getting an issue for us as this seems > to impact performance on all devices on the cluster so i/o service time of ALL > devices rises to about 300ms when e.g. a storage vmotion is done on one oft > he devices so putting more disks to the cluster does not help us to > reasonably improve the performance in the moment. > > BTW this happens on ALL drbd devices in this cluster with different raid sets > and different disks and so on. > > Thanks to everyone in advance, > > regards, Felix > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user