[DRBD-user] I/O Stalling on /dev/drbd0

Mark Watts m.watts at eris.qinetiq.com
Tue Sep 18 16:50:59 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

Ok, I have several DRBD cluster pairs, all using Dell PowerEdge 1950 servers 
(twin dual-core Xeons with 2GB ram, 2x 250GB SATA disks on PERC 5 RAID-1)

I'm running DRBD version 8.0.0 pre 4, as shipped with Mandriva 2007.1, using 
heartbeat to manage failover.

eth0 on each server is connected to a Cisco 2950 and acts as the main service 
eth1 on each server is connected via GigE crossover, for DRBD replication.
Heartbeat uses both interfaces for healthchecking.

I'm running PostgreSQL and another application from the DRBD partition, but 
neither are doing much yet since we're just testing things. (In fact 
PostgreSQL is all but idle, and the other app is just polling a bunch of 
servers every few minutes).

global {
    usage-count yes;

common {
  syncer { rate 100M; }

resource r0 {
  protocol C;
  handlers {
    pri-on-incon-degr "halt -f";
    pri-lost-after-sb "halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
  startup {
  disk {
    on-io-error   detach;
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;

  syncer {
    rate 100M;
    al-extents 257;

  on server1 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    meta-disk  internal;

  on server2 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    meta-disk internal;

Quite frequently (say every 90 secs or so) all I/O on the DRBD device seems to 
stall - interactive SSH sessions will hang for ~15 seconds.
Generally speaking, vmstat shows 25% I/O wait, while top shows one of the 4 
CPU's is at 100% I/O wait for extended periods. However, vmstat is also 
reporting that actual bytes transferred is negligble - mostly its <100 
Top is not obviously showing any processes that may be causing this issue.

Periodically, and not on the same frequency as the stalls, I see the following 
in my syslog:

Sep 16 14:49:22 server1 kernel: drbd0: ASSERT( b->n_req == set_size ) in 
Sep 16 14:49:22 server1 kernel: drbd0: b->n_req = 2 in 
Sep 16 14:49:22 server1 kernel: drbd0: set_size = 1 in 

Is there anything I can do to improve things? The raw horsepower these boxes 
have shouldn't be giving me anything like the I/O stalls I'm seeing.


Mark Watts BSc RHCE MBCS
Senior Systems Engineer
QinetiQ Trusted Information Management
Trusted Solutions and Services Group
GPG Key: http://keyserver.veridis.com:11371/search?q=0x455420ED
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070918/6359fb98/attachment.pgp>

More information about the drbd-user mailing list