Mike Lovell wrote:
> Johan Verrept wrote:
>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>> first off, hello everybody. i'm somewhat new to drbd and definitely new 
>>> to the mailing list.
>>> i am try to set up a cheap alternative to a iscsi san using some 
>>> somewhat commodity hardware and drbd. i happen to have some 10 gigabit 
>>> network interfaces around so i thought it would be a great interconnect 
>>> for the drbd replication and probably as the interconnect to the rest of 
>>> the network.
>>> things were going well in my small proof of concept but when i made the 
>>> jump to the 10 gigabit network interfaces, i started running into 
>>> troubles with drbd not being able to complete a synchronization. it will 
>>> get anywhere between 5 and 15 percent done (on a 2TB volume) and the 
>>> stall. the only thing i have been able to do to get things going again 
>>> is to take down the network interface, stop drbd, bring back up the 
>>> interface, start drbd, and wait for it to stall again. i have to take 
>>> down the network interface because drbd wont respond until then.
>>> in dmesg on the node with the UpToDate disk, i see errors like this in 
>>> the kernel log.
>>> [191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776 
>>> KB [452253194 bits set]).
>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, 
>>> ko = 4294967295
>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, 
>>> ko = 4294967294
>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, 
>>> ko = 4294967293
>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, 
>>> ko = 4294967292
>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, 
>>> ko = 4294967291
>>> in my trouble shooting, i tried changing the replication to use the 
>>> gigabit network interfaces already in the system and the synchronization 
>>> completed. i also tried a newer kernel and a new version of drbd.
>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14 
>>> that are with the distro. the system is a single opteron 2346 on a 
>>> supermicro h8dme-2 with a intel 10 gigabit nic. the underlying device is 
>>> a software raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3 
>>> but it didn't help.
>>> has anyone seen anything like this or have any recommendations?
>> <disclaimer> I am not an expert at drbd </disclaimer>
>> I have seen similar things (stalling drbd) mentioned on the mailing
>> list. Mostly the reaction is a finger pointing first to your network
>> interface/drivers. Perhaps you should look into that first? From your
>> symptoms, I would strongly suspect the problem is there (especially
>> since it works fine once you switch interfaces). Perhaps run a few iperf
>> test to see if it runs smoothly?
>> 	J.
> i realized right after i sent my request that i hadn't done any load 
> or integrity testing on the 10 gigabit interfaces since i moved them 
> around and reinstalled the OS. i had previously used these nics for 
> stuff other than drbd and so i assumed that things were still 
> operating properly. i am going to start some testing on the interfaces 
> and see if i see any problems but considering my previous experience 
> with these cards, i'm doubting that is the problem. no harm in 
> checking though. i'll let the list know the results of my test.
> has anyone else on the list been able to do drbd over 10 gigabit links 
> before and been successful with it? if so, what was your hardware and 
> software set up to do it?

i did some performance and load testing on the 10 gig interfaces today. 
using a variety of methods, i moved > 10 TiB of data across the link 
without dropped packets or connection interrupt. i things like `cat 
/dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf 
and NPtcp between the nodes. no kernel errors, no connection drops, no 
dropped packets listed in ifconfig for the devices. i even just tried 
building the latest drivers for the nic from intel and the problem remains.

any other thoughts?


