[DRBD-user] 15 min timeout to drop TCP conn on MTU change (8.2.5)

Tue Dec 2 17:49:18 CET 2008

(I apologize for the dupe, I just posted this on the devel list but I 
intended to post this here)

A weird problem hit me today: I changed the MTU on the standby node in an 
active/passive PG cluster based on drbd; this caused a freeze of exactly 
15 min on the drbd device, during which all postgres threads couldn't 
commit.

Any idea why the timeout was so long? 

Note that the two nodes are in separate locations, linked by a (currently 
mostly idle) 100Mbps bridge. 

This is the last message from postgres in /var/log/messages:

Dec  2 10:20:18 alice postgres[28729]: [38440-1] 2008-12-02 10:20:18 GMT 
radiusdb 192.168.0.51 28729 48fff5bc.7039 SELECTLOG:  duration: 0.183 ms

Nothing happens for 15 mins, until this shows up:

Dec  2 10:35:12 alice kernel: drbd1: sock_recvmsg returned -110
Dec  2 10:35:12 alice kernel: drbd1: peer( Secondary -> Unknown ) conn
( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknow
n ) 
Dec  2 10:35:12 alice kernel: drbd1: short read expecting header on sock: 
r=-110
Dec  2 10:35:12 alice kernel: drbd1: asender terminated
Dec  2 10:35:12 alice kernel: drbd1: Terminating asender thread
Dec  2 10:35:12 alice kernel: drbd1: Creating new current UUID
Dec  2 10:35:12 alice kernel: drbd1: Writing meta data super block now.
Dec  2 10:35:12 alice kernel: drbd1: tl_clear()
Dec  2 10:35:12 alice kernel: drbd1: Connection closed
Dec  2 10:35:12 alice kernel: drbd1: conn( BrokenPipe -> Unconnected ) 
Dec  2 10:35:12 alice kernel: drbd1: receiver terminated
Dec  2 10:35:12 alice kernel: drbd1: receiver (re)started

This is:

# uname -a
Linux alkaid 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4 03:51:21 EDT 2008 x86_64 
x86_64 x86_64 GNU/Linux
# rpm -qa |grep drbd
drbd-km-2.6.18_92.1.13.el5-8.2.6-3
drbd-8.2.6-3

Here is my drbd.conf:

common {
	protocol C;
	startup {
		wfc-timeout 10;
		degr-wfc-timeout 10;
	}

	disk {
		on-io-error detach;
	}

	net {
		cram-hmac-alg "sha1";
		shared-secret "xxxxxxx";
	}

	syncer {
		rate 10M;
		verify-alg md5;
	}
}

resource rb {

	on alice {
		device /dev/drbd1;
		disk /dev/System/data_share;
		address 192.168.5.21:7789;
		meta-disk       internal;
	}
	on bob {
		device /dev/drbd1;
		disk /dev/System/data_share;
		address 192.168.5.22:7789;
		meta-disk       internal;
	}
}