[DRBD-user] 15 min timeout to drop TCP conn on MTU change (8.2.5)

Fri Dec 5 17:48:07 CET 2008

On Tue, Dec 02, 2008 at 04:49:18PM +0000, NM wrote:
> (I apologize for the dupe, I just posted this on the devel list but I 
> intended to post this here)
> 
> 
> A weird problem hit me today: I changed the MTU on the standby node in an 
> active/passive PG cluster based on drbd; this caused a freeze of exactly 
> 15 min on the drbd device, during which all postgres threads couldn't 
> commit.
> 
> Any idea why the timeout was so long? 

nope.

> Note that the two nodes are in separate locations, linked by a (currently 
> mostly idle) 100Mbps bridge. 
> 
> 
> This is the last message from postgres in /var/log/messages:
> 
> Dec  2 10:20:18 alice postgres[28729]: [38440-1] 2008-12-02 10:20:18 GMT 
> radiusdb 192.168.0.51 28729 48fff5bc.7039 SELECTLOG:  duration: 0.183 ms
> 
> Nothing happens for 15 mins, until this shows up:
> 
> Dec  2 10:35:12 alice kernel: drbd1: sock_recvmsg returned -110

-ETIMEDOUT

hm.
should have timedout within 6 seconds, as that is our default timeout.
strangeness in the tcp stack, I guess.

> Dec  2 10:35:12 alice kernel: drbd1: peer( Secondary -> Unknown ) conn
> ( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknow
> n ) 
> Dec  2 10:35:12 alice kernel: drbd1: short read expecting header on sock: 
> r=-110
> Dec  2 10:35:12 alice kernel: drbd1: asender terminated
> Dec  2 10:35:12 alice kernel: drbd1: Terminating asender thread
> Dec  2 10:35:12 alice kernel: drbd1: Creating new current UUID
> Dec  2 10:35:12 alice kernel: drbd1: Writing meta data super block now.
> Dec  2 10:35:12 alice kernel: drbd1: tl_clear()
> Dec  2 10:35:12 alice kernel: drbd1: Connection closed
> Dec  2 10:35:12 alice kernel: drbd1: conn( BrokenPipe -> Unconnected ) 
> Dec  2 10:35:12 alice kernel: drbd1: receiver terminated
> Dec  2 10:35:12 alice kernel: drbd1: receiver (re)started
> 
> This is:
> 
> # uname -a
> Linux alkaid 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4 03:51:21 EDT 2008 x86_64 
> x86_64 x86_64 GNU/Linux
> # rpm -qa |grep drbd
> drbd-km-2.6.18_92.1.13.el5-8.2.6-3
> drbd-8.2.6-3
> 
> 
> Here is my drbd.conf:
> 
> common {
> 	protocol C;
> 	startup {
> 		wfc-timeout 10;
> 		degr-wfc-timeout 10;
> 	}
> 
> 	disk {
> 		on-io-error detach;
> 	}
> 
> 	net {
> 		cram-hmac-alg "sha1";
> 		shared-secret "xxxxxxx";
> 	}
> 
> 	syncer {
> 		rate 10M;
> 		verify-alg md5;
> 	}
> }
> 
> resource rb {
> 
> 	on alice {
> 		device /dev/drbd1;
> 		disk /dev/System/data_share;
> 		address 192.168.5.21:7789;
> 		meta-disk       internal;
> 	}
> 	on bob {
> 		device /dev/drbd1;
> 		disk /dev/System/data_share;
> 		address 192.168.5.22:7789;
> 		meta-disk       internal;
> 	}
> }

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed