[DRBD-user] Bonding [WAS repeated resync/fail/resync]

Thu Jan 13 22:28:18 CET 2011

On Sat, 8 Jan 2011, Steve Thompson wrote:

CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The 
replication link is a dual GbE bonded pair (point-to-point, no switches) 
in balance-rr mode with MTU=9000. Using tcp_reordering=127.

I reported that a resync failed and restarted every minute or so for a 
couple of weeks. I have found the cause, but am not sure of the solution.
I'll try to keep it short.

First, I swapped cables, cards, systems, etc in order to be sure of the 
integrity of the hardware. All hardware checks out OK.

Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both). 
Only when this was removed from the configuration was I able to get a full 
resync to complete. There is an ext3 file system on the drbd volume, but 
it is quiet; this is a non-production test system.

After this, a verify pass showed several out of sync blocks. I disconnect 
and reconnect and re-run the verify pass. Now more out of sync blocks, but 
in a different place. Rinse and repeat; verify was never clean, and out of 
sync blocks were never in the same place twice.

I changed MTU to 1500. No difference; still can't get a clean verify.

I changed tcp_reordering to 3. No difference (no difference in 
performance, either).

Finally, I shut down half of the bonded pair on each system, so I'm using 
effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow, 
now everything is working fine; syncs are clean, verifies are clean, 
violins are playing.

My question is: WTF? I'd really like to get the bonding pair working 
again, for redundancy and performance, but it very quickly falls apart in 
this case. I'd appreciate any insight into this that anyone can give.

Steve