Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I am experiencing similiar problems though not so severe. A Num,ber of Systems running SLES 11 HA / SLES11 HA SP1 on both IBM and Dell Hardware, Interlinks Intel IGB, Boradcom or E1000 all showing the same behaviour. From Time to time the verification fails, the device disconnects, reconnects, is fine again. we are not using Jumbo frames, most systems are directly connected, some going over switches, bonding mode 4 or 0, dependinig on tha system. However, we are able to do a full sync without problems, and there are no OOS-blocks found when doing a verify. The only impact is on performance, if the problem occurrs more frequently. Reconnect ans resync are done within 1-2 seconds. Quality of the intterlink has a huge impact, however even wit new cables and new NICS and short cables this shows up, just less frequently. I am not sure if it is tied to bonding, because AFAIR i have also seen this with a single link. DRBD-versions tested range from 8.3.4 to 8.3.8. Kernel 2.6.27 (Sles11) and 2.6.32 (Sles11SP1) at the newest available patchlevel, DRBD also patched to the newest levels available from Novell for the respective SLES releases. The ammount of trouble seems tied to the IO-load on the system. Most notable not all devices are affected at the same time, so there can be two DRBD devices under load, with onel having issues all the time, while another running over the same path on the same systems under simmilar load at the same time has none. disabling traffic offloading does not always help, but gives better stability on some systems. firmware- and driverupdates did not help so far. Mit freundlichen Grüßen / Best Regards Robert Köppl Systemadministration KNAPP Systemintegration GmbH Waltenbachstraße 9 8700 Leoben, Austria Phone: +43 3842 805-910 Fax: +43 3842 82930-500 robert.koeppl at knapp.com www.KNAPP.com Commercial register number: FN 138870x Commercial register court: Leoben The information in this e-mail (including any attachment) is confidential and intended to be for the use of the addressee(s) only. If you have received the e-mail by mistake, any disclosure, copy, distribution or use of the contents of the e-mail is prohibited, and you must delete the e-mail from your system. As e-mail can be changed electronically KNAPP assumes no responsibility for any alteration to this e-mail or its attachments. KNAPP has taken every reasonable precaution to ensure that any attachment to this e-mail has been swept for virus. However, KNAPP does not accept any liability for damage sustained as a result of such attachment being virus infected and strongly recommend that you carry out your own virus check before opening any attachment. Steve Thompson <smt at vgersoft.com> Gesendet von: drbd-user-bounces at lists.linbit.com 13.01.2011 22:28 An drbd-user at lists.linbit.com Kopie Thema [DRBD-user] Bonding [WAS repeated resync/fail/resync] On Sat, 8 Jan 2011, Steve Thompson wrote: CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The replication link is a dual GbE bonded pair (point-to-point, no switches) in balance-rr mode with MTU=9000. Using tcp_reordering=127. I reported that a resync failed and restarted every minute or so for a couple of weeks. I have found the cause, but am not sure of the solution. I'll try to keep it short. First, I swapped cables, cards, systems, etc in order to be sure of the integrity of the hardware. All hardware checks out OK. Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both). Only when this was removed from the configuration was I able to get a full resync to complete. There is an ext3 file system on the drbd volume, but it is quiet; this is a non-production test system. After this, a verify pass showed several out of sync blocks. I disconnect and reconnect and re-run the verify pass. Now more out of sync blocks, but in a different place. Rinse and repeat; verify was never clean, and out of sync blocks were never in the same place twice. I changed MTU to 1500. No difference; still can't get a clean verify. I changed tcp_reordering to 3. No difference (no difference in performance, either). Finally, I shut down half of the bonded pair on each system, so I'm using effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow, now everything is working fine; syncs are clean, verifies are clean, violins are playing. My question is: WTF? I'd really like to get the bonding pair working again, for redundancy and performance, but it very quickly falls apart in this case. I'd appreciate any insight into this that anyone can give. Steve _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110114/723f1d2d/attachment.htm>