<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:Courier New, courier, monaco, monospace, sans-serif;font-size:10pt"><DIV>Greetings,</DIV>

<DIV>&nbsp;</DIV>

<DIV>We are experiencing issues with our IBM Server x3655/DRBD-NFS cluster<BR>setup in our lab. (This IBM server uses an onboard Broadcom NetXtreme II<BR>gigabit ethernet chipset.) </DIV>

<DIV>&nbsp;</DIV>

<DIV>Our setup has never worked completely correctly.&nbsp; It seems <BR>to be extremely slow, and we have noticed some other symptoms that <BR>lead me to believe that we are seeing the same ToE issue that other<BR>people have reported.</DIV>

<DIV>&nbsp;</DIV>

<DIV>So far I have been unable to locate any jumper or switch that I can set<BR>on the IBM server to turn this off.</DIV>

<DIV>&nbsp;</DIV>

<DIV>Symptoms:</DIV>

<DIV>1. Generally SLOW performance when all nodes are in the cluster, but if the</DIV>

<DIV>&nbsp;&nbsp; backup node is down, performance is acceptable.<BR>&nbsp;&nbsp;&nbsp;<BR>&nbsp;&nbsp; I did a set of copy and remove operations with both DRBD nodes <BR>&nbsp;&nbsp; (active/backup) enables, and again with the backup node shut down.<BR>&nbsp;&nbsp; The directory structure I was copying is 122 megs in size<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; COPY&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DELETE<BR>&nbsp;&nbsp;&nbsp; both Nodes active&nbsp;&nbsp;&nbsp; ~20 seconds&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ~6 seconds<BR>&nbsp;&nbsp;&nbsp; One active node&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ~ 4 seconds&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &gt;1 second</DIV>

<DIV>&nbsp;</DIV>

<DIV>&nbsp;&nbsp;&nbsp; When the copy or delete are taking place, we see the I/O wait<BR>&nbsp;&nbsp;&nbsp; on one of the CPUs on the server spike to 100% if both DRBD <BR>&nbsp;&nbsp;&nbsp; nodes are active.</DIV>

<DIV>&nbsp;</DIV>

<DIV>2.&nbsp; SSH issues<BR>&nbsp;&nbsp; Servers that mount directories from our DRBD/NFS server will, on<BR>&nbsp;&nbsp; occasion, seem to pause for 5 to 15 seconds, then continue to work.<BR>&nbsp;&nbsp; I noticed that this was mentioned in one of the ToE threads as well.</DIV>

<DIV>&nbsp;</DIV>

<DIV>I have not found a way to disable TOE via hardware on these servers.<BR>Disabling via software (with the commands below) has not helped.<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth0 rx off<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth0 tx off<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth0 sg off<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth1 rx off<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth1 tx off<BR>&nbsp;&nbsp;&nbsp;&nbsp; ethtool -K eth1 sg off</DIV>

<DIV>&nbsp;</DIV>

<DIV>Does anyone have any experience running DRBD on these servers?</DIV>

<DIV>Any other suggestions on what to try?</DIV>

<DIV>&nbsp;</DIV>

<DIV>Configuration:<BR>&nbsp; Two IBM Server x3655 servers<BR>&nbsp; DRBD version 8.2.5<BR>&nbsp; Red Hat Enterprise Linux Server release 5.1 (Tikanga) (64 bit)<BR>&nbsp; pacemaker-0.6.5-2.2<BR>&nbsp; heartbeat-2.1.3-23.1<BR>&nbsp; <BR>&nbsp; DRBD currently uses the same NIC as the network, but we are<BR>&nbsp; going to move it to the second NIC later.<BR>&nbsp; We have used transfer rates of 10M, 40M and 400M (on an isolated<BR>&nbsp; network), but still see the same issues.<BR>&nbsp; </DIV>

<DIV>------------------------------------------------------------------<BR>&gt;&gt;&gt;cat /proc/drbd<BR>version: 8.2.5 (api:88/proto:86-88)<BR>GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by bachbuilder@, 2008-03-23 14:10:04<BR>&nbsp;0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---<BR>&nbsp;&nbsp;&nbsp; ns:0 nr:252 dw:252 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; resync: used:0/31 hits:20 misses:6 starving:0 dirty:0 changed:6<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0</DIV>

<DIV><BR>------------------------------------------------------------------<BR>&gt;&gt;&gt;cat /etc/drbd.conf<BR># drbd.conf<BR>resource drbd0 {<BR>&nbsp;protocol C;<BR>&nbsp;handlers {<BR>&nbsp;&nbsp;&nbsp; pri-on-incon-degr "echo o &gt; /proc/sysrq-trigger ; halt -f";<BR>&nbsp;&nbsp;&nbsp; pri-lost-after-sb "echo o &gt; /proc/sysrq-trigger ; halt -f";<BR>&nbsp;&nbsp;&nbsp; local-io-error "echo o &gt; /proc/sysrq-trigger ; halt -f";<BR>&nbsp;&nbsp;&nbsp; outdate-peer "/usr/sbin/drbd-peer-outdater";<BR>&nbsp;}<BR>&nbsp;startup {<BR>&nbsp;&nbsp;&nbsp; degr-wfc-timeout 120;&nbsp;&nbsp;&nbsp; # 2 minutes.<BR>&nbsp; }</DIV>

<DIV>&nbsp; disk {<BR>&nbsp;&nbsp;&nbsp; on-io-error&nbsp;&nbsp; detach;<BR>&nbsp; }</DIV>

<DIV>&nbsp; syncer {<BR>&nbsp;&nbsp;&nbsp; rate 40M;<BR>&nbsp;&nbsp;&nbsp; al-extents 257;<BR>&nbsp; }</DIV>

<DIV>on int-dbs-01 {<BR>&nbsp;&nbsp; device&nbsp;&nbsp;&nbsp;&nbsp; /dev/drbd0;&nbsp;&nbsp; #<BR>&nbsp;&nbsp; disk&nbsp;&nbsp;&nbsp; /dev/sdd2;<BR>&nbsp;&nbsp; address&nbsp;&nbsp;&nbsp; 172.24.2.211:7799;<BR>&nbsp;&nbsp; meta-disk&nbsp;&nbsp;&nbsp; /dev/sdd1[0];<BR>&nbsp; }</DIV>

<DIV>on int-dbs-02 {<BR>&nbsp;&nbsp; device&nbsp;&nbsp;&nbsp; /dev/drbd0; #<BR>&nbsp;&nbsp; disk&nbsp;&nbsp;&nbsp; /dev/sdd2;<BR>&nbsp;&nbsp; address&nbsp;&nbsp;&nbsp; 172.24.2.212:7799;<BR>&nbsp;&nbsp; meta-disk&nbsp;&nbsp;&nbsp; /dev/sdd1[0];<BR>&nbsp; }<BR>}<BR></DIV></div><br>


      </body></html>