[DRBD-user] Antwort: Bonding [WAS repeated resync/fail/resync]

Fri Jan 14 09:57:02 CET 2011

I am experiencing similiar problems though not so severe.
A Num,ber of Systems running SLES 11 HA / SLES11 HA SP1 on both IBM and 
Dell Hardware, Interlinks Intel IGB, Boradcom or E1000 all showing the 
same behaviour. From Time to time the verification fails, the device 
disconnects, reconnects, is fine again. we are not using Jumbo frames, 
most systems are directly connected, some going over switches, bonding 
mode 4 or 0, dependinig on tha system. However, we are able to do a full 
sync without problems, and there are no OOS-blocks found when doing a 
verify. The only impact is on performance, if the problem occurrs more 
frequently. Reconnect ans resync are done within 1-2 seconds. Quality of 
the intterlink has a huge impact, however even wit new cables and new NICS 
and short cables this shows up, just less frequently. I am not sure if it 
is tied to bonding, because AFAIR i have also seen this with a single 
link. DRBD-versions tested range from 8.3.4 to 8.3.8. Kernel 2.6.27 
(Sles11) and 2.6.32 (Sles11SP1) at the newest available patchlevel, DRBD 
also patched to the newest levels available from Novell for the respective 
SLES releases. 
The ammount of trouble seems tied to the IO-load on the system. Most 
notable not all devices are affected at the same time, so there can be two 
DRBD devices under load, with onel having issues all the time, while 
another running over the same path on the same systems under simmilar load 
at the same time has none. disabling traffic offloading does not always 
help, but gives better stability on some systems. firmware- and 
driverupdates did not help so far.

Mit freundlichen Grüßen / Best Regards

Robert Köppl

Systemadministration

KNAPP Systemintegration GmbH
Waltenbachstraße 9
8700 Leoben, Austria 
Phone: +43 3842 805-910
Fax: +43 3842 82930-500
robert.koeppl at knapp.com 
www.KNAPP.com 

Commercial register number: FN 138870x
Commercial register court: Leoben

The information in this e-mail (including any attachment) is confidential 
and intended to be for the use of the addressee(s) only. If you have 
received the e-mail by mistake, any disclosure, copy, distribution or use 
of the contents of the e-mail is prohibited, and you must delete the 
e-mail from your system. As e-mail can be changed electronically KNAPP 
assumes no responsibility for any alteration to this e-mail or its 
attachments. KNAPP has taken every reasonable precaution to ensure that 
any attachment to this e-mail has been swept for virus. However, KNAPP 
does not accept any liability for damage sustained as a result of such 
attachment being virus infected and strongly recommend that you carry out 
your own virus check before opening any attachment.

Steve Thompson <smt at vgersoft.com> 
Gesendet von: drbd-user-bounces at lists.linbit.com
13.01.2011 22:28

An
drbd-user at lists.linbit.com
Kopie

Thema
[DRBD-user] Bonding [WAS repeated resync/fail/resync]

On Sat, 8 Jan 2011, Steve Thompson wrote:

CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The 
replication link is a dual GbE bonded pair (point-to-point, no switches) 
in balance-rr mode with MTU=9000. Using tcp_reordering=127.

I reported that a resync failed and restarted every minute or so for a 
couple of weeks. I have found the cause, but am not sure of the solution.
I'll try to keep it short.

First, I swapped cables, cards, systems, etc in order to be sure of the 
integrity of the hardware. All hardware checks out OK.

Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both). 

Only when this was removed from the configuration was I able to get a full 

resync to complete. There is an ext3 file system on the drbd volume, but 
it is quiet; this is a non-production test system.

After this, a verify pass showed several out of sync blocks. I disconnect 
and reconnect and re-run the verify pass. Now more out of sync blocks, but 

in a different place. Rinse and repeat; verify was never clean, and out of 

sync blocks were never in the same place twice.

I changed MTU to 1500. No difference; still can't get a clean verify.

I changed tcp_reordering to 3. No difference (no difference in 
performance, either).

Finally, I shut down half of the bonded pair on each system, so I'm using 
effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow, 
now everything is working fine; syncs are clean, verifies are clean, 
violins are playing.

My question is: WTF? I'd really like to get the bonding pair working 
again, for redundancy and performance, but it very quickly falls apart in 
this case. I'd appreciate any insight into this that anyone can give.

Steve
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110114/723f1d2d/attachment.htm>