[DRBD-user] drbd + 10gig network

Fri Oct 16 13:14:19 CEST 2009

On Fri, Oct 16, 2009 at 02:21:40AM -0600, Mike Lovell wrote:
> Mike Lovell wrote:
>> Johan Verrept wrote:
>>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>>   
>>>> first off, hello everybody. i'm somewhat new to drbd and definitely 
>>>> new to the mailing list.
>>>>
>>>> i am try to set up a cheap alternative to a iscsi san using some  
>>>> somewhat commodity hardware and drbd. i happen to have some 10 
>>>> gigabit network interfaces around so i thought it would be a great 
>>>> interconnect for the drbd replication and probably as the 
>>>> interconnect to the rest of the network.
>>>>
>>>> things were going well in my small proof of concept but when i made 
>>>> the jump to the 10 gigabit network interfaces, i started running 
>>>> into troubles with drbd not being able to complete a 
>>>> synchronization. it will get anywhere between 5 and 15 percent done 
>>>> (on a 2TB volume) and the stall. the only thing i have been able to 
>>>> do to get things going again is to take down the network interface, 
>>>> stop drbd, bring back up the interface, start drbd, and wait for it 
>>>> to stall again. i have to take down the network interface because 
>>>> drbd wont respond until then.
>>>>
>>>> in dmesg on the node with the UpToDate disk, i see errors like this 
>>>> in the kernel log.
>>>>
>>>> [191401.876167] drbd0: Began resync as SyncSource (will sync 
>>>> 1809012776 KB [452253194 bits set]).
>>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time 
>>>> expired, ko = 4294967295
>>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time 
>>>> expired, ko = 4294967294
>>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time 
>>>> expired, ko = 4294967293
>>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time 
>>>> expired, ko = 4294967292
>>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time 
>>>> expired, ko = 4294967291
>>>>
>>>> in my trouble shooting, i tried changing the replication to use the 
>>>> gigabit network interfaces already in the system and the 
>>>> synchronization completed. i also tried a newer kernel and a new 
>>>> version of drbd.
>>>>
>>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd 
>>>> 8.0.14 that are with the distro. the system is a single opteron 
>>>> 2346 on a supermicro h8dme-2 with a intel 10 gigabit nic. the 
>>>> underlying device is a software raid10 with linux md. i did try a 
>>>> 2.6.30 kernel and drbd 8.3 but it didn't help.
>>>>
>>>> has anyone seen anything like this or have any recommendations?
>>>>     
>>>
>>> <disclaimer> I am not an expert at drbd </disclaimer>
>>>
>>> I have seen similar things (stalling drbd) mentioned on the mailing
>>> list. Mostly the reaction is a finger pointing first to your network
>>> interface/drivers. Perhaps you should look into that first? From your
>>> symptoms, I would strongly suspect the problem is there (especially
>>> since it works fine once you switch interfaces). Perhaps run a few iperf
>>> test to see if it runs smoothly?
>>>
>>> 	J.
>>>
>>>   
>> i realized right after i sent my request that i hadn't done any load  
>> or integrity testing on the 10 gigabit interfaces since i moved them  
>> around and reinstalled the OS. i had previously used these nics for  
>> stuff other than drbd and so i assumed that things were still  
>> operating properly. i am going to start some testing on the interfaces  
>> and see if i see any problems but considering my previous experience  
>> with these cards, i'm doubting that is the problem. no harm in  
>> checking though. i'll let the list know the results of my test.
>>
>> has anyone else on the list been able to do drbd over 10 gigabit links  
>> before and been successful with it? if so, what was your hardware and  
>> software set up to do it?
>
> i did some performance and load testing on the 10 gig interfaces today.  
> using a variety of methods, i moved > 10 TiB of data across the link  
> without dropped packets or connection interrupt. i things like `cat  
> /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf  
> and NPtcp between the nodes. no kernel errors, no connection drops, no  
> dropped packets listed in ifconfig for the devices. i even just tried  
> building the latest drivers for the nic from intel and the problem 
> remains.
>
> any other thoughts?

try DRBD 8.3.4.
It handles some settings more gracefully.

On <= 8.3.2, try decreasing sync-rate, and increase "max-buffers".

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed