Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Oct 16, 2009 at 02:21:40AM -0600, Mike Lovell wrote: > Mike Lovell wrote: >> Johan Verrept wrote: >>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote: >>> >>>> first off, hello everybody. i'm somewhat new to drbd and definitely >>>> new to the mailing list. >>>> >>>> i am try to set up a cheap alternative to a iscsi san using some >>>> somewhat commodity hardware and drbd. i happen to have some 10 >>>> gigabit network interfaces around so i thought it would be a great >>>> interconnect for the drbd replication and probably as the >>>> interconnect to the rest of the network. >>>> >>>> things were going well in my small proof of concept but when i made >>>> the jump to the 10 gigabit network interfaces, i started running >>>> into troubles with drbd not being able to complete a >>>> synchronization. it will get anywhere between 5 and 15 percent done >>>> (on a 2TB volume) and the stall. the only thing i have been able to >>>> do to get things going again is to take down the network interface, >>>> stop drbd, bring back up the interface, start drbd, and wait for it >>>> to stall again. i have to take down the network interface because >>>> drbd wont respond until then. >>>> >>>> in dmesg on the node with the UpToDate disk, i see errors like this >>>> in the kernel log. >>>> >>>> [191401.876167] drbd0: Began resync as SyncSource (will sync >>>> 1809012776 KB [452253194 bits set]). >>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time >>>> expired, ko = 4294967295 >>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time >>>> expired, ko = 4294967294 >>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time >>>> expired, ko = 4294967293 >>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time >>>> expired, ko = 4294967292 >>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time >>>> expired, ko = 4294967291 >>>> >>>> in my trouble shooting, i tried changing the replication to use the >>>> gigabit network interfaces already in the system and the >>>> synchronization completed. i also tried a newer kernel and a new >>>> version of drbd. >>>> >>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd >>>> 8.0.14 that are with the distro. the system is a single opteron >>>> 2346 on a supermicro h8dme-2 with a intel 10 gigabit nic. the >>>> underlying device is a software raid10 with linux md. i did try a >>>> 2.6.30 kernel and drbd 8.3 but it didn't help. >>>> >>>> has anyone seen anything like this or have any recommendations? >>>> >>> >>> <disclaimer> I am not an expert at drbd </disclaimer> >>> >>> I have seen similar things (stalling drbd) mentioned on the mailing >>> list. Mostly the reaction is a finger pointing first to your network >>> interface/drivers. Perhaps you should look into that first? From your >>> symptoms, I would strongly suspect the problem is there (especially >>> since it works fine once you switch interfaces). Perhaps run a few iperf >>> test to see if it runs smoothly? >>> >>> J. >>> >>> >> i realized right after i sent my request that i hadn't done any load >> or integrity testing on the 10 gigabit interfaces since i moved them >> around and reinstalled the OS. i had previously used these nics for >> stuff other than drbd and so i assumed that things were still >> operating properly. i am going to start some testing on the interfaces >> and see if i see any problems but considering my previous experience >> with these cards, i'm doubting that is the problem. no harm in >> checking though. i'll let the list know the results of my test. >> >> has anyone else on the list been able to do drbd over 10 gigabit links >> before and been successful with it? if so, what was your hardware and >> software set up to do it? > > i did some performance and load testing on the 10 gig interfaces today. > using a variety of methods, i moved > 10 TiB of data across the link > without dropped packets or connection interrupt. i things like `cat > /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf > and NPtcp between the nodes. no kernel errors, no connection drops, no > dropped packets listed in ifconfig for the devices. i even just tried > building the latest drivers for the nic from intel and the problem > remains. > > any other thoughts? try DRBD 8.3.4. It handles some settings more gracefully. On <= 8.3.2, try decreasing sync-rate, and increase "max-buffers". -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed