[DRBD-user] Howto find latency bottleneck

Wed Sep 1 18:29:28 CEST 2010

  I'm currently setting up a new PostgreSQL database cluster.
I have two supermicro servers,
both with 8 Seagate SATA disks 500GB.

The segate disks perform with max. 140 MB/sec according to hdparm -t

The 8 disks are combined as a RAID-10 with layout f2, chunk size
giving seq. read performance of max. 900MB/sec and write performance of 
330MB/sec.

The servers have 4 gigabit ethernet cross cables bonded and achive above 
4 gbps according to iperf.

When running the throughput and latency test described in the DRBD 
documentation http://www.drbd.org/users-guide-emb/ch-benchmark.html

When testing between two 10GB ramdisk on both servers:

*Throughput: (changed to 10 times 512M)*

DRBD Connected:
5368709120 bytes (5.4 GB) copied, 13.5616 seconds, 396 MB/s
5368709120 bytes (5.4 GB) copied, 12.7732 seconds, 420 MB/s
5368709120 bytes (5.4 GB) copied, 13.1005 seconds, 410 MB/s
5368709120 bytes (5.4 GB) copied, 12.3543 seconds, 435 MB/s
5368709120 bytes (5.4 GB) copied, 12.778 seconds, 420 MB/s
This is very good, for 4 x 1gigabit.

DRBD Disconnected:
5368709120 bytes (5.4 GB) copied, 1.71126 seconds, 3.1 GB/s
5368709120 bytes (5.4 GB) copied, 1.50127 seconds, 3.6 GB/s
5368709120 bytes (5.4 GB) copied, 1.70883 seconds, 3.1 GB/s
5368709120 bytes (5.4 GB) copied, 1.5004 seconds, 3.6 GB/s
5368709120 bytes (5.4 GB) copied, 1.50411 seconds, 3.6 GB/s

*Latency: (changed to write 1000 times 4k)*

DRBD Connected:
4096000 bytes (4.1 MB) copied, 0.380132 seconds, 10.8 MB/s

DRBD Disconneted:
4096000 bytes (4.1 MB) copied, 0.00427 seconds, 959 MB/s

Which looks very good.
DRBD latency when using ramdisks show 0.4 ms of latency
is because of the gigabit network latency (ping times are avg. 0.2 ms)

But when run the same tests for a 1.4 TB RAID-10 partition:

*Throughput: (changed to 10 times 512M)*

DRBD Connected:
5368709120 bytes (5.4 GB) copied, 29.7357 seconds, 181 MB/s
5368709120 bytes (5.4 GB) copied, 20.0908 seconds, 267 MB/s
5368709120 bytes (5.4 GB) copied, 20.1032 seconds, 267 MB/s
5368709120 bytes (5.4 GB) copied, 20.8837 seconds, 257 MB/s
5368709120 bytes (5.4 GB) copied, 20.3392 seconds, 264 MB/s

DRBD Disconnected:
5368709120 bytes (5.4 GB) copied, 14.3373 seconds, 374 MB/s
5368709120 bytes (5.4 GB) copied, 14.8781 seconds, 361 MB/s
5368709120 bytes (5.4 GB) copied, 14.6109 seconds, 367 MB/s
5368709120 bytes (5.4 GB) copied, 14.3471 seconds, 374 MB/s
5368709120 bytes (5.4 GB) copied, 14.7132 seconds, 365 MB/s

Ok, there's around 30% performance loss, but for me still acceptable.

*Latency: (changed to write 1000 times 4k)*

DRBD Connected:
4096000 bytes (4.1 MB) copied, 24.7744 seconds, 165 kB/s

DRBD Disconneted:
4096000 bytes (4.1 MB) copied, 0.198809 seconds, 20.6 MB/s

Local disk shows 0.2ms latency, very acceptable,
But with DRBD connected is shows 24ms latency.

And I would expect here (0.2 from DRBD disconnected + 0.4 network 
latency + some extra overhead = ) around 1ms latency.

So my conclusion:
There is no network bottleneck,
There is no DRBD bottleneck,
There is no Disk bottleneck,
But Combining DRBD with the disks, there is some kind of bottle neck.

So does anybody know where to start looking to find my bottleneck?

Thnx in advance,

Grz,
Robert Verspuy

-- 
*Exa-Omicron*
Patroonsweg 10
3892 DB Zeewolde
Tel.: 088-OMICRON (66 427 66)
http://www.exa-omicron.nl