Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sat, 8 Jan 2011, Steve Thompson wrote: CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The replication link is a dual GbE bonded pair (point-to-point, no switches) in balance-rr mode with MTU=9000. Using tcp_reordering=127. I reported that a resync failed and restarted every minute or so for a couple of weeks. I have found the cause, but am not sure of the solution. I'll try to keep it short. First, I swapped cables, cards, systems, etc in order to be sure of the integrity of the hardware. All hardware checks out OK. Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both). Only when this was removed from the configuration was I able to get a full resync to complete. There is an ext3 file system on the drbd volume, but it is quiet; this is a non-production test system. After this, a verify pass showed several out of sync blocks. I disconnect and reconnect and re-run the verify pass. Now more out of sync blocks, but in a different place. Rinse and repeat; verify was never clean, and out of sync blocks were never in the same place twice. I changed MTU to 1500. No difference; still can't get a clean verify. I changed tcp_reordering to 3. No difference (no difference in performance, either). Finally, I shut down half of the bonded pair on each system, so I'm using effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow, now everything is working fine; syncs are clean, verifies are clean, violins are playing. My question is: WTF? I'd really like to get the bonding pair working again, for redundancy and performance, but it very quickly falls apart in this case. I'd appreciate any insight into this that anyone can give. Steve