[DRBD-user] Always get Split brain after reboot of both nodes

Chris Joelly chris-m-lists at joelly.net
Fri Aug 15 18:09:28 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

i always get a Split Brain situation on one drbd device after a reboot
of both nodes is done. I'm wondering why this doesn't happen on the
second drbd device?

on the peer node there are 

[drbd0_receiver/5137] sock_sendmsg time expired, ko = 5

messages in the logfile, but i checked network copnnectivity on the sync 
if (crossover 100mbit FD, equal nics) from both sides, and i get around 
11,5mb/s everytime i try with iperf.

i also tuned tcp stack with sysctl with the following params:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

i don't know if these values are fine with my setup, but with Ubuntu 8.04 
server defaults the same behaviour happens ...

how could i track down what the problem is with this device? And why the 
other device is not affected by this network timeouts?

thx,

Chris

the log shows:

Aug 15 17:28:51 parastore01 kernel: [   49.368122] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:51 parastore01 kernel: [   49.368132] drbd0: Starting worker thread (from cqueue/0 [3899])
Aug 15 17:28:51 parastore01 kernel: [   49.425995] drbd0: Found 31 transactions (565 active extents) in activity log.
Aug 15 17:28:51 parastore01 kernel: [   49.426005] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:51 parastore01 kernel: [   49.426012] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:51 parastore01 kernel: [   49.428212] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:51 parastore01 kernel: [   49.428223] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:51 parastore01 kernel: [   49.506257] drbd0: reading of bitmap took 8 jiffies
Aug 15 17:28:51 parastore01 kernel: [   49.508871] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:51 parastore01 kernel: [   49.508878] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:51 parastore01 kernel: [   49.509167] drbd0: Marked additional 2192 MB as out-of-sync based on AL.
Aug 15 17:28:52 parastore01 kernel: [   49.717365] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [   49.717377] drbd0: Writing meta data super block now.
Aug 15 17:28:52 parastore01 kernel: [   49.876601] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:52 parastore01 kernel: [   49.876762] drbd0: Starting receiver thread (from drbd0_worker [5090])
Aug 15 17:28:52 parastore01 kernel: [   49.877852] drbd0: receiver (re)started
Aug 15 17:28:52 parastore01 kernel: [   49.877864] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:52 parastore01 kernel: [   49.972310] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:52 parastore01 kernel: [   50.004672] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:52 parastore01 kernel: [   50.004684] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:52 parastore01 kernel: [   50.004690] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:28:52 parastore01 kernel: [   50.008915] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [   50.008932] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [   98.531749] drbd0: meta connection shut down by peer.
Aug 15 17:29:40 parastore01 kernel: [   98.531848] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:40 parastore01 kernel: [   98.531865] drbd0: asender terminated
Aug 15 17:29:40 parastore01 kernel: [   98.531868] drbd0: Terminating asender thread
Aug 15 17:29:40 parastore01 kernel: [   98.532383] drbd0: role( Secondary -> Primary )
Aug 15 17:29:40 parastore01 kernel: [   98.532395] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [   98.533186] drbd0: sock_sendmsg returned -104
Aug 15 17:29:40 parastore01 kernel: [   98.533251] drbd0: short sent ReportState size=12 sent=0
Aug 15 17:29:40 parastore01 kernel: [   98.534146] drbd0: tl_clear()
Aug 15 17:29:40 parastore01 kernel: [   98.534152] drbd0: Connection closed
Aug 15 17:29:40 parastore01 kernel: [   98.534157] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:40 parastore01 kernel: [   98.534161] drbd0: receiver terminated
Aug 15 17:29:40 parastore01 kernel: [   98.534164] drbd0: receiver (re)started
Aug 15 17:29:40 parastore01 kernel: [   98.534167] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:41 parastore01 kernel: [   98.830269] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:41 parastore01 kernel: [   98.862770] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:41 parastore01 kernel: [   98.862790] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:41 parastore01 kernel: [   98.862796] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:29:41 parastore01 kernel: [   98.863560] drbd0: Split-Brain detected, dropping connection!
Aug 15 17:29:41 parastore01 kernel: [   98.863632] drbd0: self D86BD7893327D85B:5F076D68071F86E5:A7706C3E4205FA52:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [   98.863636] drbd0: peer 60C952BD7240404F:5F076D68071F86E4:A7706C3E4205FA53:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [   98.863642] drbd0: conn( WFReportParams -> Disconnecting )
Aug 15 17:29:41 parastore01 kernel: [   98.863648] drbd0: helper command: /sbin/drbdadm split-brain
Aug 15 17:29:41 parastore01 kernel: [   98.869163] drbd0: error receiving ReportState, l: 4!
Aug 15 17:29:41 parastore01 kernel: [   98.869395] drbd0: asender terminated
Aug 15 17:29:41 parastore01 kernel: [   98.869401] drbd0: Terminating asender thread
Aug 15 17:29:41 parastore01 kernel: [   98.870023] drbd0: tl_clear()
Aug 15 17:29:41 parastore01 kernel: [   98.870030] drbd0: Connection closed
Aug 15 17:29:41 parastore01 kernel: [   98.870043] drbd0: conn( Disconnecting -> StandAlone )
Aug 15 17:29:41 parastore01 kernel: [   98.870049] drbd0: receiver terminated
Aug 15 17:29:41 parastore01 kernel: [   98.870052] drbd0: Terminating receiver thread

the log of the peer node:

Aug 15 17:28:43 parastore02 kernel: [   66.035432] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:43 parastore02 kernel: [   66.035442] drbd0: Starting worker thread (from cqueue/0 [3890])
Aug 15 17:28:43 parastore02 kernel: [   66.074118] drbd0: Found 6 transactions (6 active extents) in activity log.
Aug 15 17:28:43 parastore02 kernel: [   66.074127] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:43 parastore02 kernel: [   66.074134] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:43 parastore02 kernel: [   66.076351] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:43 parastore02 kernel: [   66.076362] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:43 parastore02 kernel: [   66.131997] drbd0: reading of bitmap took 6 jiffies
Aug 15 17:28:43 parastore02 kernel: [   66.134610] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:43 parastore02 kernel: [   66.134615] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:43 parastore02 kernel: [   66.134644] drbd0: Marked additional 24 MB as out-of-sync based on AL.
Aug 15 17:28:43 parastore02 kernel: [   66.149767] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:43 parastore02 kernel: [   66.149778] drbd0: Writing meta data super block now.
Aug 15 17:28:43 parastore02 kernel: [   66.314471] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:43 parastore02 kernel: [   66.314636] drbd0: Starting receiver thread (from drbd0_worker [5118])
Aug 15 17:28:43 parastore02 kernel: [   66.315752] drbd0: receiver (re)started
Aug 15 17:28:43 parastore02 kernel: [   66.315764] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:44 parastore02 kernel: [   67.017585] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:44 parastore02 kernel: [   67.018675] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:44 parastore02 kernel: [   67.018687] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:44 parastore02 kernel: [   67.018706] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:28:44 parastore02 kernel: [   67.064460] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:44 parastore02 kernel: [   67.064476] drbd0: Writing meta data super block now.
Aug 15 17:29:02 parastore02 kernel: [   85.589243] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5
Aug 15 17:29:08 parastore02 kernel: [   91.586558] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 4
Aug 15 17:29:14 parastore02 kernel: [   97.583871] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 3
Aug 15 17:29:20 parastore02 kernel: [  103.581185] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 2
Aug 15 17:29:26 parastore02 kernel: [  109.578499] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 1
Aug 15 17:29:32 parastore02 kernel: [  115.575814] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:32 parastore02 kernel: [  115.575829] drbd0: short sent ReportBitMap size=4096 sent=3216
Aug 15 17:29:32 parastore02 kernel: [  115.575925] drbd0: error receiving ReportBitMap, l: 0!
Aug 15 17:29:32 parastore02 kernel: [  115.576429] drbd0: role( Secondary -> Primary )
Aug 15 17:29:32 parastore02 kernel: [  115.576441] drbd0: Creating new current UUID
Aug 15 17:29:32 parastore02 kernel: [  115.576451] drbd0: Writing meta data super block now.
Aug 15 17:29:32 parastore02 kernel: [  115.576548] drbd0: asender terminated
Aug 15 17:29:32 parastore02 kernel: [  115.576554] drbd0: Terminating asender thread
Aug 15 17:29:32 parastore02 kernel: [  115.577220] drbd0: tl_clear()
Aug 15 17:29:32 parastore02 kernel: [  115.577226] drbd0: Connection closed
Aug 15 17:29:32 parastore02 kernel: [  115.577233] drbd0: conn( Timeout -> Unconnected )
Aug 15 17:29:32 parastore02 kernel: [  115.577237] drbd0: receiver terminated
Aug 15 17:29:32 parastore02 kernel: [  115.577240] drbd0: receiver (re)started
Aug 15 17:29:32 parastore02 kernel: [  115.577243] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:33 parastore02 kernel: [  115.875697] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:33 parastore02 kernel: [  115.876347] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:33 parastore02 kernel: [  115.876359] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:33 parastore02 kernel: [  115.876364] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:29:33 parastore02 kernel: [  115.915155] drbd0: meta connection shut down by peer.
Aug 15 17:29:33 parastore02 kernel: [  115.915226] drbd0: conn( WFReportParams -> NetworkFailure )
Aug 15 17:29:33 parastore02 kernel: [  115.915236] drbd0: asender terminated
Aug 15 17:29:33 parastore02 kernel: [  115.915239] drbd0: Terminating asender thread
Aug 15 17:29:33 parastore02 kernel: [  115.916116] drbd0: tl_clear()
Aug 15 17:29:33 parastore02 kernel: [  115.916122] drbd0: Connection closed
Aug 15 17:29:33 parastore02 kernel: [  115.916130] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:33 parastore02 kernel: [  115.916134] drbd0: receiver terminated
Aug 15 17:29:33 parastore02 kernel: [  115.916163] drbd0: receiver (re)started
Aug 15 17:29:33 parastore02 kernel: [  115.916168] drbd0: conn( Unconnected -> WFConnection )

config of drbd0:

disk {
	size            	0s _is_default; # bytes
	on-io-error     	detach;
	fencing         	dont-care _is_default;
}
net {
	timeout         	60 _is_default; # 1/10 seconds
	max-epoch-size  	2048 _is_default;
	max-buffers     	2048 _is_default;
	unplug-watermark	128 _is_default;
	connect-int     	10 _is_default; # seconds
	ping-int        	10 _is_default; # seconds
	sndbuf-size     	131070 _is_default; # bytes
	ko-count        	6;
	allow-two-primaries;
	cram-hmac-alg   	"md5";
	shared-secret   	"Para2008Store";
	after-sb-0pri   	discard-zero-changes;
	after-sb-1pri   	discard-secondary;
	after-sb-2pri   	disconnect _is_default;
	rr-conflict     	disconnect _is_default;
	ping-timeout    	5 _is_default; # 1/10 seconds
}
syncer {
	rate            	5120k; # bytes/second
	after           	-1 _is_default;
	al-extents      	1801;
}
protocol C;
_this_host {
	device			"/dev/drbd0";
	disk			"/dev/sda4";
	meta-disk		internal;
	address			192.168.99.2:7788;
}
_remote_host {
	address			192.168.99.1:7788;
}


-- 
"The greatest proof that intelligent life other that humans exists in
 the universe is that none of it has tried to contact us!"




More information about the drbd-user mailing list