[DRBD-user] DRBD doesn't seem to scale to fast underlying hardware with typical database load

Sat Aug 9 20:21:41 CEST 2008

Hi all,

Please forgive the subject if it seems accusatory, I don't mean it as  
such.  I've actually been a happy DRBD user for a couple years on more  
basic hardware and have nothing but praise for how well it has worked  
for me.  That said, where I've used it before is in situations where  
the database is mostly reads, and the schema highly normalized to  
focus on minimal I/O needed for each write.  That said, the  
performance I'm seeing with DRBD at this point seems to indicate  
something is going very, very wrong with DRBD itself as opposed to  
being a fault of higher write I/O requirements (I go into detail  
below, but for general database transaction measurements, I see up to  
a 1000x difference between local disk and DRBD performance.  Going  
forward, the databases will be mostly write-only, as we'll be  
utilizing memcached (updated automatically via pg/memcached) for most  
reads.  So I guess a pretty core question is whether DRBD is generally  
a good solution for a write-heavy setup, given enough underlying I/O.   
Currently, our entire database fits into RAM, so read speed is really  
not much of concern at all - write speed (particularly concurrent  
writes) is.

Where I'm working now, we are currently write-bound on a server with a  
RAID1 of two 15,000rpm SAS disks.

We've just purchased two fully-loaded Sun Fire x4240's - the original  
intent was to use DRBD to synchronize data on a RAID10 consisting of  
ten 10,000rpm SAS disks.  I was under the impression that while DRBD  
would add some write overhead, the overall write throughput of this  
configuration would be significantly higher than our existing server,  
while adding HA-level reliability to our production database.

I'll spit out all the hardware/software specs that might be relevent -  
please let me know if more information about any is required.  I am  
also happy to run (or re-run) more tests as requested:

Hardware (each of two servers):
* 10-disk RAID10 for PostgreSQL data (intended to be replicated with  
DRBD).
* 2-disk RAID1 for PostgreSQL write-ahead log (intended to be  
replicated with DRBD).
* 2-disk RAID1 for operating system.  All disks are Hitachi 2.5" SAS  
10,000rpm.
* 2 disks as global hot spares
* Rebranded Adaptec RAID card - "Sun STK RAID" - I'm *guessing* this  
is actually an Adaptec RAID 3805.
* 256MB battery-backed write cache enabled.
* Individual drive write cache disabled.
* 32GB DDR2 RAM.
* 4x NVidia NForce ethernet cards
* Crossover cabling between two of the ethernet ports

Software:
* Debian Linux - "etch" (latest stable)
* Kernel version 2.6.24 pulled from etch-backports (mainly just as a  
result of trying to see if a newer kernel/raid driver would help  
matters any).  Have not seen any significant performance difference  
when using the 2.6.18 kernel that etch stable provides.
* DRBD 8.0.12, also pulled from etch-backports.
* 802.3ad bonding used for two of the ethernet ports which have  
crossover cabling.  This is the DRBD network.
* MTU set to 9000 (jumbo frames) - another effort to improve  
performance that had no measurable impact whatsoever.
* PostgreSQL 8.1.13 (upgrading to later major releases is not  
currently an option but slated to happen late this year or early 2009).

Tests (that I remember, all DRBD results after initial sync completed,  
"local disk" means the same underlying RAID10 that DRBD would  
otherwise use):
* using dd with bs=1GB count=1 and dsync:
   * If I start DRBD on only one server after a fresh boot with no  
slave connected, attains ~75MB/s write.
   * With slave connected, or even after disconnected once it has  
connected, attains ~16MB/s write.
   * Do not remember result with local disk, but I think it was well  
over 100MB/s.
* PostgreSQL initdb:
   * Local disk:  near-instantaneous, under 1 second.
   * DRBD with protocol C:  17 seconds.
   * DRBD with protocol B:  17 seconds.
   * DRBD with protocol A:  6 seconds.
* PostgreSQL pgbench -i -s 100 (inserts a bunch of data):
   * Local disk: very fast. (sorry I don't have exact numbers handy  
for these, but I can get them if desired).
   * DRBD with slave disconnected:  Pretty fast, but not comparable to  
local disk.
   * DRBD with slave connected:  Quite slow, at least 10 times longer  
than local disk.
* PostgreSQL pgbench -t 1000
   * DRBD using 1 client (single connection, so no database-level  
parallelization):  3 transactions per second (!)
   * DRBD using 2, 4, 8, 16, 32 clients - I see better results with  
parallelization, but it's still rather pitiful, and the total  
transactions per second is always less than number of connections * 3.
   * Local disk using 1 client:  ~1500 transactions per second
   * Local disk using 4+ clients:  >3000 transactions per second.

Things I've tried to get performance to a tolerable level:
* Settings in sysctl.conf for network/disk-level improvement:
kernel.sem=500 512000 64 2048
kernel.shmmax=8589934592
kernel.shmall=2097152
kernel.msgmni=2048
kernel.msgmax=65536
net.core.rmem_default=262144
net.core.rmem_max=8388608
net.core.wmem_default=262144
net.core.wmem_max=8388608
net.ipv4.tcp_rmem=4096 87380 8388608
net.ipv4.tcp_wmem=4096 87380 8388608
net.ipv4.tcp_mem=4096 4096 4096
vm.dirty_background_ratio 1
vm.overcommit_memory=2
vm.swappiness=0
* I/O scheduler set to deadline.
* JFS filesystem.
* DRBD tuning:
   * Tried setting al-extents anywhere between the lowest and highest  
possible settings with prime numbers.
   * Set max_buffers a lot higher.
   * Tried changing the watermark setting between the lowest all the  
way up to match a very high max_buffers setting.
* wal_sync_method = fdatasync in postgresql.conf

None of the above has really made much measurable impact.  At best, I  
slice a fraction of a second off of the initdb test, and make  
similarly negligible impacts on the other tests.

Some observations:
* sysstat -x on the slave shows that even during the dd test, the  
write I/O of the underlying raid device is under 5% utilization,  
except for a spike up over 50% right at the end of the dd run.
* I was using standard MTU for almost all of my testing.  I only tried  
jumbo frames recently, so I don't think that's the problem (I've read  
something about DRBD and jumbo frames not playing nice in some  
circumstances).
* This thread sounds somewhat similar to my situation (but never  
really reaches any resolution):  http://lists.linbit.com/pipermail/drbd-user/2008-April/009155.html
* Sync speed seems pretty good - /proc/drbd shows rates between 50MB/s  
and 100MB/s when it is catching up after disconnected operation or the  
initial sync.
* I've tried using a single plain ethernet device rather than a bond  
device, but it doesn't help, at least with sync speed.
* Parallel write operations hit pretty hard - for instance if pgbench - 
i is running (inserting a bunch of rows) and an autovacuum process  
starts up, the inserts go so slow I think for a moment it's hung  
entirely.
* We have a few weeks before these servers go live but realistically  
if I can't make DRBD work a lot better within the next week or so I'm  
going to have to give up on it for now...

My thoughts about what could be wrong:
* Maybe DRBD itself is simply limited (I don't really think so,  
certainly hope not)
* Maybe the network cards are crap, and we should purchase some Intel  
Pro PCIe gigabit adapters to use instead.
* Maybe the RAID card is crap, and we should purchase some LSI  
MegaRAID cards instead (it's going to be pretty hard to get any more  
hardware purchased this year, so I'm hoping it's not a hardware issue,  
and also using any custom hardware means that our Sun support contract  
becomes worth a lot less).
* Maybe JFS is a bad choice to use on top of DRBD (kind of doubt it)  
and we should use another filesystem who's disk accesses match DRBD's  
expectations better.

We're a small company without a lot of resources and we've spent the  
available database budget on these servers, so I'm hoping it's  
possible to get some good advice without a support contract, however  
if we do manage to get DRBD working in a tolerable fashion, I'm going  
to push (most likely successfully, since this hardware should be  
adequate for our needs for a long time to come) for purchasing a  
support contract next year as I really don't like spending so much  
time trying to get things tuned when I have other work to be doing.   
If we can't get past the DRBD problems soon (or if in fact DRBD simply  
does not scale up to this amount of I/O very well), we'll be going  
forward with a database-level replication solution (not clear on what  
the best options are for us yet).  If there's any PostgreSQL folks on  
the list, I'm open to discussion about that, however please reply to  
me privately in that case so as not to clutter up the DRBD list.   
Similarly, if there are alternate lower-level replication solutions  
more adequate for high write I/O I welcome suggestions.

Many thanks in advance,
-- 
Casey Allen Shobe
Database Architect, The Berkeley Electronic Press
cshobe at bepress.com (email/jabber/aim/msn)
http://www.bepress.com | +1 (510) 665-1200 x163
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080809/cb602710/attachment.htm>