Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Mike Lovell wrote: > Johan Verrept wrote: >> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote: >> >>> first off, hello everybody. i'm somewhat new to drbd and definitely new >>> to the mailing list. >>> >>> i am try to set up a cheap alternative to a iscsi san using some >>> somewhat commodity hardware and drbd. i happen to have some 10 gigabit >>> network interfaces around so i thought it would be a great interconnect >>> for the drbd replication and probably as the interconnect to the rest of >>> the network. >>> >>> things were going well in my small proof of concept but when i made the >>> jump to the 10 gigabit network interfaces, i started running into >>> troubles with drbd not being able to complete a synchronization. it will >>> get anywhere between 5 and 15 percent done (on a 2TB volume) and the >>> stall. the only thing i have been able to do to get things going again >>> is to take down the network interface, stop drbd, bring back up the >>> interface, start drbd, and wait for it to stall again. i have to take >>> down the network interface because drbd wont respond until then. >>> >>> in dmesg on the node with the UpToDate disk, i see errors like this in >>> the kernel log. >>> >>> [191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776 >>> KB [452253194 bits set]). >>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, >>> ko = 4294967295 >>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, >>> ko = 4294967294 >>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, >>> ko = 4294967293 >>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, >>> ko = 4294967292 >>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, >>> ko = 4294967291 >>> >>> in my trouble shooting, i tried changing the replication to use the >>> gigabit network interfaces already in the system and the synchronization >>> completed. i also tried a newer kernel and a new version of drbd. >>> >>> i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14 >>> that are with the distro. the system is a single opteron 2346 on a >>> supermicro h8dme-2 with a intel 10 gigabit nic. the underlying device is >>> a software raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3 >>> but it didn't help. >>> >>> has anyone seen anything like this or have any recommendations? >>> >> >> <disclaimer> I am not an expert at drbd </disclaimer> >> >> I have seen similar things (stalling drbd) mentioned on the mailing >> list. Mostly the reaction is a finger pointing first to your network >> interface/drivers. Perhaps you should look into that first? From your >> symptoms, I would strongly suspect the problem is there (especially >> since it works fine once you switch interfaces). Perhaps run a few iperf >> test to see if it runs smoothly? >> >> J. >> >> > i realized right after i sent my request that i hadn't done any load > or integrity testing on the 10 gigabit interfaces since i moved them > around and reinstalled the OS. i had previously used these nics for > stuff other than drbd and so i assumed that things were still > operating properly. i am going to start some testing on the interfaces > and see if i see any problems but considering my previous experience > with these cards, i'm doubting that is the problem. no harm in > checking though. i'll let the list know the results of my test. > > has anyone else on the list been able to do drbd over 10 gigabit links > before and been successful with it? if so, what was your hardware and > software set up to do it? i did some performance and load testing on the 10 gig interfaces today. using a variety of methods, i moved > 10 TiB of data across the link without dropped packets or connection interrupt. i things like `cat /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf and NPtcp between the nodes. no kernel errors, no connection drops, no dropped packets listed in ifconfig for the devices. i even just tried building the latest drivers for the nic from intel and the problem remains. any other thoughts? mike