[DRBD-user] I/O Stalling on /dev/drbd0

Tue Sep 18 16:50:59 CEST 2007

Ok, I have several DRBD cluster pairs, all using Dell PowerEdge 1950 servers 
(twin dual-core Xeons with 2GB ram, 2x 250GB SATA disks on PERC 5 RAID-1)

I'm running DRBD version 8.0.0 pre 4, as shipped with Mandriva 2007.1, using 
heartbeat to manage failover.

eth0 on each server is connected to a Cisco 2950 and acts as the main service 
LAN
eth1 on each server is connected via GigE crossover, for DRBD replication.
Heartbeat uses both interfaces for healthchecking.

I'm running PostgreSQL and another application from the DRBD partition, but 
neither are doing much yet since we're just testing things. (In fact 
PostgreSQL is all but idle, and the other app is just polling a bunch of 
servers every few minutes).

/etc/drbd.conf:
##
global {
    usage-count yes;
}

common {
  syncer { rate 100M; }
}

resource r0 {
  protocol C;
  handlers {
    pri-on-incon-degr "halt -f";
    pri-lost-after-sb "halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
  }
  startup {
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on server1 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    172.16.1.1:7788;
    meta-disk  internal;
  }

  on server2 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   172.16.1.2:7788;
    meta-disk internal;
  }
}
##

Quite frequently (say every 90 secs or so) all I/O on the DRBD device seems to 
stall - interactive SSH sessions will hang for ~15 seconds.
Generally speaking, vmstat shows 25% I/O wait, while top shows one of the 4 
CPU's is at 100% I/O wait for extended periods. However, vmstat is also 
reporting that actual bytes transferred is negligble - mostly its <100 
bytes/sec.
Top is not obviously showing any processes that may be causing this issue.

Periodically, and not on the same frequency as the stalls, I see the following 
in my syslog:

Sep 16 14:49:22 server1 kernel: drbd0: ASSERT( b->n_req == set_size ) in 
drivers/block/drbd/drbd_main.c:299
Sep 16 14:49:22 server1 kernel: drbd0: b->n_req = 2 in 
drivers/block/drbd/drbd_main.c:307
Sep 16 14:49:22 server1 kernel: drbd0: set_size = 1 in 
drivers/block/drbd/drbd_main.c:308

Is there anything I can do to improve things? The raw horsepower these boxes 
have shouldn't be giving me anything like the I/O stalls I'm seeing.

Mark.

-- 
Mark Watts BSc RHCE MBCS
Senior Systems Engineer
QinetiQ Trusted Information Management
Trusted Solutions and Services Group
GPG Key: http://keyserver.veridis.com:11371/search?q=0x455420ED
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070918/6359fb98/attachment.pgp>