Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
i always get a Split Brain situation on one drbd device after a reboot
of both nodes is done. I'm wondering why this doesn't happen on the
second drbd device?
on the peer node there are
[drbd0_receiver/5137] sock_sendmsg time expired, ko = 5
messages in the logfile, but i checked network copnnectivity on the sync
if (crossover 100mbit FD, equal nics) from both sides, and i get around
11,5mb/s everytime i try with iperf.
i also tuned tcp stack with sysctl with the following params:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
i don't know if these values are fine with my setup, but with Ubuntu 8.04
server defaults the same behaviour happens ...
how could i track down what the problem is with this device? And why the
other device is not affected by this network timeouts?
thx,
Chris
the log shows:
Aug 15 17:28:51 parastore01 kernel: [ 49.368122] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:51 parastore01 kernel: [ 49.368132] drbd0: Starting worker thread (from cqueue/0 [3899])
Aug 15 17:28:51 parastore01 kernel: [ 49.425995] drbd0: Found 31 transactions (565 active extents) in activity log.
Aug 15 17:28:51 parastore01 kernel: [ 49.426005] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:51 parastore01 kernel: [ 49.426012] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:51 parastore01 kernel: [ 49.428212] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:51 parastore01 kernel: [ 49.428223] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:51 parastore01 kernel: [ 49.506257] drbd0: reading of bitmap took 8 jiffies
Aug 15 17:28:51 parastore01 kernel: [ 49.508871] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:51 parastore01 kernel: [ 49.508878] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:51 parastore01 kernel: [ 49.509167] drbd0: Marked additional 2192 MB as out-of-sync based on AL.
Aug 15 17:28:52 parastore01 kernel: [ 49.717365] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [ 49.717377] drbd0: Writing meta data super block now.
Aug 15 17:28:52 parastore01 kernel: [ 49.876601] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:52 parastore01 kernel: [ 49.876762] drbd0: Starting receiver thread (from drbd0_worker [5090])
Aug 15 17:28:52 parastore01 kernel: [ 49.877852] drbd0: receiver (re)started
Aug 15 17:28:52 parastore01 kernel: [ 49.877864] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:52 parastore01 kernel: [ 49.972310] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:52 parastore01 kernel: [ 50.004672] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:52 parastore01 kernel: [ 50.004684] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:52 parastore01 kernel: [ 50.004690] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:28:52 parastore01 kernel: [ 50.008915] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [ 50.008932] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [ 98.531749] drbd0: meta connection shut down by peer.
Aug 15 17:29:40 parastore01 kernel: [ 98.531848] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:40 parastore01 kernel: [ 98.531865] drbd0: asender terminated
Aug 15 17:29:40 parastore01 kernel: [ 98.531868] drbd0: Terminating asender thread
Aug 15 17:29:40 parastore01 kernel: [ 98.532383] drbd0: role( Secondary -> Primary )
Aug 15 17:29:40 parastore01 kernel: [ 98.532395] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [ 98.533186] drbd0: sock_sendmsg returned -104
Aug 15 17:29:40 parastore01 kernel: [ 98.533251] drbd0: short sent ReportState size=12 sent=0
Aug 15 17:29:40 parastore01 kernel: [ 98.534146] drbd0: tl_clear()
Aug 15 17:29:40 parastore01 kernel: [ 98.534152] drbd0: Connection closed
Aug 15 17:29:40 parastore01 kernel: [ 98.534157] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:40 parastore01 kernel: [ 98.534161] drbd0: receiver terminated
Aug 15 17:29:40 parastore01 kernel: [ 98.534164] drbd0: receiver (re)started
Aug 15 17:29:40 parastore01 kernel: [ 98.534167] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:41 parastore01 kernel: [ 98.830269] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:41 parastore01 kernel: [ 98.862770] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:41 parastore01 kernel: [ 98.862790] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:41 parastore01 kernel: [ 98.862796] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:29:41 parastore01 kernel: [ 98.863560] drbd0: Split-Brain detected, dropping connection!
Aug 15 17:29:41 parastore01 kernel: [ 98.863632] drbd0: self D86BD7893327D85B:5F076D68071F86E5:A7706C3E4205FA52:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [ 98.863636] drbd0: peer 60C952BD7240404F:5F076D68071F86E4:A7706C3E4205FA53:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [ 98.863642] drbd0: conn( WFReportParams -> Disconnecting )
Aug 15 17:29:41 parastore01 kernel: [ 98.863648] drbd0: helper command: /sbin/drbdadm split-brain
Aug 15 17:29:41 parastore01 kernel: [ 98.869163] drbd0: error receiving ReportState, l: 4!
Aug 15 17:29:41 parastore01 kernel: [ 98.869395] drbd0: asender terminated
Aug 15 17:29:41 parastore01 kernel: [ 98.869401] drbd0: Terminating asender thread
Aug 15 17:29:41 parastore01 kernel: [ 98.870023] drbd0: tl_clear()
Aug 15 17:29:41 parastore01 kernel: [ 98.870030] drbd0: Connection closed
Aug 15 17:29:41 parastore01 kernel: [ 98.870043] drbd0: conn( Disconnecting -> StandAlone )
Aug 15 17:29:41 parastore01 kernel: [ 98.870049] drbd0: receiver terminated
Aug 15 17:29:41 parastore01 kernel: [ 98.870052] drbd0: Terminating receiver thread
the log of the peer node:
Aug 15 17:28:43 parastore02 kernel: [ 66.035432] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:43 parastore02 kernel: [ 66.035442] drbd0: Starting worker thread (from cqueue/0 [3890])
Aug 15 17:28:43 parastore02 kernel: [ 66.074118] drbd0: Found 6 transactions (6 active extents) in activity log.
Aug 15 17:28:43 parastore02 kernel: [ 66.074127] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:43 parastore02 kernel: [ 66.074134] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:43 parastore02 kernel: [ 66.076351] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:43 parastore02 kernel: [ 66.076362] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:43 parastore02 kernel: [ 66.131997] drbd0: reading of bitmap took 6 jiffies
Aug 15 17:28:43 parastore02 kernel: [ 66.134610] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:43 parastore02 kernel: [ 66.134615] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:43 parastore02 kernel: [ 66.134644] drbd0: Marked additional 24 MB as out-of-sync based on AL.
Aug 15 17:28:43 parastore02 kernel: [ 66.149767] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:43 parastore02 kernel: [ 66.149778] drbd0: Writing meta data super block now.
Aug 15 17:28:43 parastore02 kernel: [ 66.314471] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:43 parastore02 kernel: [ 66.314636] drbd0: Starting receiver thread (from drbd0_worker [5118])
Aug 15 17:28:43 parastore02 kernel: [ 66.315752] drbd0: receiver (re)started
Aug 15 17:28:43 parastore02 kernel: [ 66.315764] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:44 parastore02 kernel: [ 67.017585] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:44 parastore02 kernel: [ 67.018675] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:44 parastore02 kernel: [ 67.018687] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:44 parastore02 kernel: [ 67.018706] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:28:44 parastore02 kernel: [ 67.064460] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:44 parastore02 kernel: [ 67.064476] drbd0: Writing meta data super block now.
Aug 15 17:29:02 parastore02 kernel: [ 85.589243] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5
Aug 15 17:29:08 parastore02 kernel: [ 91.586558] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 4
Aug 15 17:29:14 parastore02 kernel: [ 97.583871] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 3
Aug 15 17:29:20 parastore02 kernel: [ 103.581185] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 2
Aug 15 17:29:26 parastore02 kernel: [ 109.578499] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 1
Aug 15 17:29:32 parastore02 kernel: [ 115.575814] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:32 parastore02 kernel: [ 115.575829] drbd0: short sent ReportBitMap size=4096 sent=3216
Aug 15 17:29:32 parastore02 kernel: [ 115.575925] drbd0: error receiving ReportBitMap, l: 0!
Aug 15 17:29:32 parastore02 kernel: [ 115.576429] drbd0: role( Secondary -> Primary )
Aug 15 17:29:32 parastore02 kernel: [ 115.576441] drbd0: Creating new current UUID
Aug 15 17:29:32 parastore02 kernel: [ 115.576451] drbd0: Writing meta data super block now.
Aug 15 17:29:32 parastore02 kernel: [ 115.576548] drbd0: asender terminated
Aug 15 17:29:32 parastore02 kernel: [ 115.576554] drbd0: Terminating asender thread
Aug 15 17:29:32 parastore02 kernel: [ 115.577220] drbd0: tl_clear()
Aug 15 17:29:32 parastore02 kernel: [ 115.577226] drbd0: Connection closed
Aug 15 17:29:32 parastore02 kernel: [ 115.577233] drbd0: conn( Timeout -> Unconnected )
Aug 15 17:29:32 parastore02 kernel: [ 115.577237] drbd0: receiver terminated
Aug 15 17:29:32 parastore02 kernel: [ 115.577240] drbd0: receiver (re)started
Aug 15 17:29:32 parastore02 kernel: [ 115.577243] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:33 parastore02 kernel: [ 115.875697] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:33 parastore02 kernel: [ 115.876347] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:33 parastore02 kernel: [ 115.876359] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:33 parastore02 kernel: [ 115.876364] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:29:33 parastore02 kernel: [ 115.915155] drbd0: meta connection shut down by peer.
Aug 15 17:29:33 parastore02 kernel: [ 115.915226] drbd0: conn( WFReportParams -> NetworkFailure )
Aug 15 17:29:33 parastore02 kernel: [ 115.915236] drbd0: asender terminated
Aug 15 17:29:33 parastore02 kernel: [ 115.915239] drbd0: Terminating asender thread
Aug 15 17:29:33 parastore02 kernel: [ 115.916116] drbd0: tl_clear()
Aug 15 17:29:33 parastore02 kernel: [ 115.916122] drbd0: Connection closed
Aug 15 17:29:33 parastore02 kernel: [ 115.916130] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:33 parastore02 kernel: [ 115.916134] drbd0: receiver terminated
Aug 15 17:29:33 parastore02 kernel: [ 115.916163] drbd0: receiver (re)started
Aug 15 17:29:33 parastore02 kernel: [ 115.916168] drbd0: conn( Unconnected -> WFConnection )
config of drbd0:
disk {
size 0s _is_default; # bytes
on-io-error detach;
fencing dont-care _is_default;
}
net {
timeout 60 _is_default; # 1/10 seconds
max-epoch-size 2048 _is_default;
max-buffers 2048 _is_default;
unplug-watermark 128 _is_default;
connect-int 10 _is_default; # seconds
ping-int 10 _is_default; # seconds
sndbuf-size 131070 _is_default; # bytes
ko-count 6;
allow-two-primaries;
cram-hmac-alg "md5";
shared-secret "Para2008Store";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect _is_default;
rr-conflict disconnect _is_default;
ping-timeout 5 _is_default; # 1/10 seconds
}
syncer {
rate 5120k; # bytes/second
after -1 _is_default;
al-extents 1801;
}
protocol C;
_this_host {
device "/dev/drbd0";
disk "/dev/sda4";
meta-disk internal;
address 192.168.99.2:7788;
}
_remote_host {
address 192.168.99.1:7788;
}
--
"The greatest proof that intelligent life other that humans exists in
the universe is that none of it has tried to contact us!"