[DRBD-user] drbd + 10gig network
Lars Ellenberg
lars.ellenberg at linbit.com
Fri Oct 16 13:14:19 CEST 2009
On Fri, Oct 16, 2009 at 02:21:40AM -0600, Mike Lovell wrote:
> Mike Lovell wrote:
>> Johan Verrept wrote:
>>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>>
>>>> first off, hello everybody. i'm somewhat new to drbd and definitely
>>>> new to the mailing list.
>>>>
>>>> i am try to set up a cheap alternative to a iscsi san using some
>>>> somewhat commodity hardware and drbd. i happen to have some 10
>>>> gigabit network interfaces around so i thought it would be a great
>>>> interconnect for the drbd replication and probably as the
>>>> interconnect to the rest of the network.
>>>>
>>>> things were going well in my small proof of concept but when i made
>>>> the jump to the 10 gigabit network interfaces, i started running
>>>> into troubles with drbd not being able to complete a
>>>> synchronization. it will get anywhere between 5 and 15 percent done
>>>> (on a 2TB volume) and the stall. the only thing i have been able to
>>>> do to get things going again is to take down the network interface,
>>>> stop drbd, bring back up the interface, start drbd, and wait for it
>>>> to stall again. i have to take down the network interface because
>>>> drbd wont respond until then.
>>>>
>>>> in dmesg on the node with the UpToDate disk, i see errors like this
>>>> in the kernel log.
>>>>
>>>> [191401.876167] drbd0: Began resync as SyncSource (will sync
>>>> 1809012776 KB [452253194 bits set]).
>>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967295
>>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967294
>>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967293
>>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967292
>>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967291
>>>>
>>>> in my trouble shooting, i tried changing the replication to use the
>>>> gigabit network interfaces already in the system and the
>>>> synchronization completed. i also tried a newer kernel and a new
>>>> version of drbd.
>>>>
>>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd
>>>> 8.0.14 that are with the distro. the system is a single opteron
>>>> 2346 on a supermicro h8dme-2 with a intel 10 gigabit nic. the
>>>> underlying device is a software raid10 with linux md. i did try a
>>>> 2.6.30 kernel and drbd 8.3 but it didn't help.
>>>>
>>>> has anyone seen anything like this or have any recommendations?
>>>>
>>>
>>> <disclaimer> I am not an expert at drbd </disclaimer>
>>>
>>> I have seen similar things (stalling drbd) mentioned on the mailing
>>> list. Mostly the reaction is a finger pointing first to your network
>>> interface/drivers. Perhaps you should look into that first? From your
>>> symptoms, I would strongly suspect the problem is there (especially
>>> since it works fine once you switch interfaces). Perhaps run a few iperf
>>> test to see if it runs smoothly?
>>>
>>> J.
>>>
>>>
>> i realized right after i sent my request that i hadn't done any load
>> or integrity testing on the 10 gigabit interfaces since i moved them
>> around and reinstalled the OS. i had previously used these nics for
>> stuff other than drbd and so i assumed that things were still
>> operating properly. i am going to start some testing on the interfaces
>> and see if i see any problems but considering my previous experience
>> with these cards, i'm doubting that is the problem. no harm in
>> checking though. i'll let the list know the results of my test.
>>
>> has anyone else on the list been able to do drbd over 10 gigabit links
>> before and been successful with it? if so, what was your hardware and
>> software set up to do it?
>
> i did some performance and load testing on the 10 gig interfaces today.
> using a variety of methods, i moved > 10 TiB of data across the link
> without dropped packets or connection interrupt. i things like `cat
> /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf
> and NPtcp between the nodes. no kernel errors, no connection drops, no
> dropped packets listed in ifconfig for the devices. i even just tried
> building the latest drivers for the nic from intel and the problem
> remains.
>
> any other thoughts?
try DRBD 8.3.4.
It handles some settings more gracefully.
On <= 8.3.2, try decreasing sync-rate, and increase "max-buffers".
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
More information about the drbd-user
mailing list