[DRBD-user] TOE issue with IBM x3655 servers?

Tue Aug 26 16:48:18 CEST 2008

Greetings,
We are experiencing issues with our IBM Server x3655/DRBD-NFS cluster
setup in our lab. (This IBM server uses an onboard Broadcom NetXtreme II
gigabit ethernet chipset.) 
Our setup has never worked completely correctly.  It seems 
to be extremely slow, and we have noticed some other symptoms that 
lead me to believe that we are seeing the same ToE issue that other
people have reported.
So far I have been unable to locate any jumper or switch that I can set
on the IBM server to turn this off.
Symptoms:
1. Generally SLOW performance when all nodes are in the cluster, but if the
   backup node is down, performance is acceptable.

   I did a set of copy and remove operations with both DRBD nodes 
   (active/backup) enables, and again with the backup node shut down.
   The directory structure I was copying is 122 megs in size
                          COPY            DELETE
    both Nodes active    ~20 seconds       ~6 seconds
    One active node      ~ 4 seconds        >1 second
    When the copy or delete are taking place, we see the I/O wait
    on one of the CPUs on the server spike to 100% if both DRBD 
    nodes are active.
2.  SSH issues
   Servers that mount directories from our DRBD/NFS server will, on
   occasion, seem to pause for 5 to 15 seconds, then continue to work.
   I noticed that this was mentioned in one of the ToE threads as well.
I have not found a way to disable TOE via hardware on these servers.
Disabling via software (with the commands below) has not helped.
     ethtool -K eth0 rx off
     ethtool -K eth0 tx off
     ethtool -K eth0 sg off
     ethtool -K eth1 rx off
     ethtool -K eth1 tx off
     ethtool -K eth1 sg off
Does anyone have any experience running DRBD on these servers?
Any other suggestions on what to try?
Configuration:
  Two IBM Server x3655 servers
  DRBD version 8.2.5
  Red Hat Enterprise Linux Server release 5.1 (Tikanga) (64 bit)
  pacemaker-0.6.5-2.2
  heartbeat-2.1.3-23.1

  DRBD currently uses the same NIC as the network, but we are
  going to move it to the second NIC later.
  We have used transfer rates of 10M, 40M and 400M (on an isolated
  network), but still see the same issues.

------------------------------------------------------------------
>>>cat /proc/drbd
version: 8.2.5 (api:88/proto:86-88)
GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by bachbuilder@, 2008-03-23 14:10:04
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:252 dw:252 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:20 misses:6 starving:0 dirty:0 changed:6
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0

------------------------------------------------------------------
>>>cat /etc/drbd.conf
# drbd.conf
resource drbd0 {
 protocol C;
 handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
 }
 startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  syncer {
    rate 40M;
    al-extents 257;
  }
on int-dbs-01 {
   device     /dev/drbd0;   #
   disk    /dev/sdd2;
   address    172.24.2.211:7799;
   meta-disk    /dev/sdd1[0];
  }
on int-dbs-02 {
   device    /dev/drbd0; #
   disk    /dev/sdd2;
   address    172.24.2.212:7799;
   meta-disk    /dev/sdd1[0];
  }
}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080826/ceae2cde/attachment.htm>