<br><font size=2 face="sans-serif">I am experiencing similiar problems
though not so severe.</font>
<br><font size=2 face="sans-serif">A Num,ber of Systems running SLES 11
HA / SLES11 HA SP1 on both IBM and Dell Hardware, Interlinks Intel IGB,
Boradcom or E1000 all showing the same behaviour. From Time to time the
verification fails, the device disconnects, reconnects, is fine again.
we are not using Jumbo frames, most systems are directly connected, some
going over switches, bonding mode 4 or 0, dependinig on tha system. However,
we are able to do a full sync without problems, and there are no OOS-blocks
found when doing a verify. The only impact is on performance, if the problem
occurrs more frequently. Reconnect ans resync are done within 1-2 seconds.
Quality of the intterlink has a huge impact, however even wit new cables
and new NICS and short cables this shows up, just less frequently. I am
not sure if it is tied to bonding, because AFAIR i have also seen this
with a single link. DRBD-versions tested range from 8.3.4 to 8.3.8. Kernel
2.6.27 (Sles11) and 2.6.32 (Sles11SP1) at the newest available patchlevel,
DRBD also patched to the newest levels available from Novell for the respective
SLES releases. </font>
<br><font size=2 face="sans-serif">The ammount of trouble seems tied to
the IO-load on the system. Most notable not all devices are affected at
the same time, so there can be two DRBD devices under load, with onel having
issues all the time, while another running over the same path on the same
systems under simmilar load at the same time has none. disabling traffic
offloading does not always help, but gives better stability on some systems.
firmware- and driverupdates did not help so far.</font>
<br><font size=2 face="sans-serif"><br>
</font><font size=2 color=#5f5f5f face="sans-serif">Mit freundlichen Grüßen
/ Best Regards<b><br>
</b></font>
<br><font size=2 color=#5f5f5f face="sans-serif">Robert Köppl<br>
<br>
Systemadministration<br>
<b><br>
KNAPP Systemintegration GmbH</b><br>
Waltenbachstraße 9<br>
8700 Leoben, Austria <br>
Phone: +43 3842 805-910<br>
Fax: +43 3842 82930-500<br>
robert.koeppl@knapp.com <br>
www.KNAPP.com <br>
<br>
Commercial register number: FN 138870x<br>
Commercial register court: Leoben<br>
</font><font size=1 color=#d2d2d2 face="sans-serif"><br>
The information in this e-mail (including any attachment) is confidential
and intended to be for the use of the addressee(s) only. If you have received
the e-mail by mistake, any disclosure, copy, distribution or use of the
contents of the e-mail is prohibited, and you must delete the e-mail from
your system. As e-mail can be changed electronically KNAPP assumes no responsibility
for any alteration to this e-mail or its attachments. KNAPP has taken every
reasonable precaution to ensure that any attachment to this e-mail has
been swept for virus. However, KNAPP does not accept any liability for
damage sustained as a result of such attachment being virus infected and
strongly recommend that you carry out your own virus check before opening
any attachment.</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Steve Thompson <smt@vgersoft.com></b>
</font>
<br><font size=1 face="sans-serif">Gesendet von: drbd-user-bounces@lists.linbit.com</font>
<p><font size=1 face="sans-serif">13.01.2011 22:28</font>
<td width=59%>
<table width=100%>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">An</font></div>
<td><font size=1 face="sans-serif">drbd-user@lists.linbit.com</font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">Kopie</font></div>
<td>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">Thema</font></div>
<td><font size=1 face="sans-serif">[DRBD-user] Bonding [WAS repeated resync/fail/resync]</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><tt><font size=2>On Sat, 8 Jan 2011, Steve Thompson wrote:<br>
<br>
CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The
<br>
replication link is a dual GbE bonded pair (point-to-point, no switches)
<br>
in balance-rr mode with MTU=9000. Using tcp_reordering=127.<br>
<br>
I reported that a resync failed and restarted every minute or so for a
<br>
couple of weeks. I have found the cause, but am not sure of the solution.<br>
I'll try to keep it short.<br>
<br>
First, I swapped cables, cards, systems, etc in order to be sure of the
<br>
integrity of the hardware. All hardware checks out OK.<br>
<br>
Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both).
<br>
Only when this was removed from the configuration was I able to get a full
<br>
resync to complete. There is an ext3 file system on the drbd volume, but
<br>
it is quiet; this is a non-production test system.<br>
<br>
After this, a verify pass showed several out of sync blocks. I disconnect
<br>
and reconnect and re-run the verify pass. Now more out of sync blocks,
but <br>
in a different place. Rinse and repeat; verify was never clean, and out
of <br>
sync blocks were never in the same place twice.<br>
<br>
I changed MTU to 1500. No difference; still can't get a clean verify.<br>
<br>
I changed tcp_reordering to 3. No difference (no difference in <br>
performance, either).<br>
<br>
Finally, I shut down half of the bonded pair on each system, so I'm using
<br>
effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow,
<br>
now everything is working fine; syncs are clean, verifies are clean, <br>
violins are playing.<br>
<br>
My question is: WTF? I'd really like to get the bonding pair working <br>
again, for redundancy and performance, but it very quickly falls apart
in <br>
this case. I'd appreciate any insight into this that anyone can give.<br>
<br>
Steve<br>
_______________________________________________<br>
drbd-user mailing list<br>
drbd-user@lists.linbit.com<br>
http://lists.linbit.com/mailman/listinfo/drbd-user<br>
</font></tt>
<br>