Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, i always get a Split Brain situation on one drbd device after a reboot of both nodes is done. I'm wondering why this doesn't happen on the second drbd device? on the peer node there are [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5 messages in the logfile, but i checked network copnnectivity on the sync if (crossover 100mbit FD, equal nics) from both sides, and i get around 11,5mb/s everytime i try with iperf. i also tuned tcp stack with sysctl with the following params: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 i don't know if these values are fine with my setup, but with Ubuntu 8.04 server defaults the same behaviour happens ... how could i track down what the problem is with this device? And why the other device is not affected by this network timeouts? thx, Chris the log shows: Aug 15 17:28:51 parastore01 kernel: [ 49.368122] drbd0: disk( Diskless -> Attaching ) Aug 15 17:28:51 parastore01 kernel: [ 49.368132] drbd0: Starting worker thread (from cqueue/0 [3899]) Aug 15 17:28:51 parastore01 kernel: [ 49.425995] drbd0: Found 31 transactions (565 active extents) in activity log. Aug 15 17:28:51 parastore01 kernel: [ 49.426005] drbd0: max_segment_size ( = BIO size ) = 32768 Aug 15 17:28:51 parastore01 kernel: [ 49.426012] drbd0: drbd_bm_resize called with capacity == 95551624 Aug 15 17:28:51 parastore01 kernel: [ 49.428212] drbd0: resync bitmap: bits=11943953 words=373250 Aug 15 17:28:51 parastore01 kernel: [ 49.428223] drbd0: size = 45 GB (47775812 KB) Aug 15 17:28:51 parastore01 kernel: [ 49.506257] drbd0: reading of bitmap took 8 jiffies Aug 15 17:28:51 parastore01 kernel: [ 49.508871] drbd0: recounting of set bits took additional 0 jiffies Aug 15 17:28:51 parastore01 kernel: [ 49.508878] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Aug 15 17:28:51 parastore01 kernel: [ 49.509167] drbd0: Marked additional 2192 MB as out-of-sync based on AL. Aug 15 17:28:52 parastore01 kernel: [ 49.717365] drbd0: disk( Attaching -> UpToDate ) Aug 15 17:28:52 parastore01 kernel: [ 49.717377] drbd0: Writing meta data super block now. Aug 15 17:28:52 parastore01 kernel: [ 49.876601] drbd0: conn( StandAlone -> Unconnected ) Aug 15 17:28:52 parastore01 kernel: [ 49.876762] drbd0: Starting receiver thread (from drbd0_worker [5090]) Aug 15 17:28:52 parastore01 kernel: [ 49.877852] drbd0: receiver (re)started Aug 15 17:28:52 parastore01 kernel: [ 49.877864] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:28:52 parastore01 kernel: [ 49.972310] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:28:52 parastore01 kernel: [ 50.004672] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:28:52 parastore01 kernel: [ 50.004684] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:28:52 parastore01 kernel: [ 50.004690] drbd0: Starting asender thread (from drbd0_receiver [5138]) Aug 15 17:28:52 parastore01 kernel: [ 50.008915] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Aug 15 17:28:52 parastore01 kernel: [ 50.008932] drbd0: Writing meta data super block now. Aug 15 17:29:40 parastore01 kernel: [ 98.531749] drbd0: meta connection shut down by peer. Aug 15 17:29:40 parastore01 kernel: [ 98.531848] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Aug 15 17:29:40 parastore01 kernel: [ 98.531865] drbd0: asender terminated Aug 15 17:29:40 parastore01 kernel: [ 98.531868] drbd0: Terminating asender thread Aug 15 17:29:40 parastore01 kernel: [ 98.532383] drbd0: role( Secondary -> Primary ) Aug 15 17:29:40 parastore01 kernel: [ 98.532395] drbd0: Writing meta data super block now. Aug 15 17:29:40 parastore01 kernel: [ 98.533186] drbd0: sock_sendmsg returned -104 Aug 15 17:29:40 parastore01 kernel: [ 98.533251] drbd0: short sent ReportState size=12 sent=0 Aug 15 17:29:40 parastore01 kernel: [ 98.534146] drbd0: tl_clear() Aug 15 17:29:40 parastore01 kernel: [ 98.534152] drbd0: Connection closed Aug 15 17:29:40 parastore01 kernel: [ 98.534157] drbd0: conn( NetworkFailure -> Unconnected ) Aug 15 17:29:40 parastore01 kernel: [ 98.534161] drbd0: receiver terminated Aug 15 17:29:40 parastore01 kernel: [ 98.534164] drbd0: receiver (re)started Aug 15 17:29:40 parastore01 kernel: [ 98.534167] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:29:41 parastore01 kernel: [ 98.830269] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:29:41 parastore01 kernel: [ 98.862770] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:29:41 parastore01 kernel: [ 98.862790] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:29:41 parastore01 kernel: [ 98.862796] drbd0: Starting asender thread (from drbd0_receiver [5138]) Aug 15 17:29:41 parastore01 kernel: [ 98.863560] drbd0: Split-Brain detected, dropping connection! Aug 15 17:29:41 parastore01 kernel: [ 98.863632] drbd0: self D86BD7893327D85B:5F076D68071F86E5:A7706C3E4205FA52:3E7B3E38C51EC4FF Aug 15 17:29:41 parastore01 kernel: [ 98.863636] drbd0: peer 60C952BD7240404F:5F076D68071F86E4:A7706C3E4205FA53:3E7B3E38C51EC4FF Aug 15 17:29:41 parastore01 kernel: [ 98.863642] drbd0: conn( WFReportParams -> Disconnecting ) Aug 15 17:29:41 parastore01 kernel: [ 98.863648] drbd0: helper command: /sbin/drbdadm split-brain Aug 15 17:29:41 parastore01 kernel: [ 98.869163] drbd0: error receiving ReportState, l: 4! Aug 15 17:29:41 parastore01 kernel: [ 98.869395] drbd0: asender terminated Aug 15 17:29:41 parastore01 kernel: [ 98.869401] drbd0: Terminating asender thread Aug 15 17:29:41 parastore01 kernel: [ 98.870023] drbd0: tl_clear() Aug 15 17:29:41 parastore01 kernel: [ 98.870030] drbd0: Connection closed Aug 15 17:29:41 parastore01 kernel: [ 98.870043] drbd0: conn( Disconnecting -> StandAlone ) Aug 15 17:29:41 parastore01 kernel: [ 98.870049] drbd0: receiver terminated Aug 15 17:29:41 parastore01 kernel: [ 98.870052] drbd0: Terminating receiver thread the log of the peer node: Aug 15 17:28:43 parastore02 kernel: [ 66.035432] drbd0: disk( Diskless -> Attaching ) Aug 15 17:28:43 parastore02 kernel: [ 66.035442] drbd0: Starting worker thread (from cqueue/0 [3890]) Aug 15 17:28:43 parastore02 kernel: [ 66.074118] drbd0: Found 6 transactions (6 active extents) in activity log. Aug 15 17:28:43 parastore02 kernel: [ 66.074127] drbd0: max_segment_size ( = BIO size ) = 32768 Aug 15 17:28:43 parastore02 kernel: [ 66.074134] drbd0: drbd_bm_resize called with capacity == 95551624 Aug 15 17:28:43 parastore02 kernel: [ 66.076351] drbd0: resync bitmap: bits=11943953 words=373250 Aug 15 17:28:43 parastore02 kernel: [ 66.076362] drbd0: size = 45 GB (47775812 KB) Aug 15 17:28:43 parastore02 kernel: [ 66.131997] drbd0: reading of bitmap took 6 jiffies Aug 15 17:28:43 parastore02 kernel: [ 66.134610] drbd0: recounting of set bits took additional 0 jiffies Aug 15 17:28:43 parastore02 kernel: [ 66.134615] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Aug 15 17:28:43 parastore02 kernel: [ 66.134644] drbd0: Marked additional 24 MB as out-of-sync based on AL. Aug 15 17:28:43 parastore02 kernel: [ 66.149767] drbd0: disk( Attaching -> UpToDate ) Aug 15 17:28:43 parastore02 kernel: [ 66.149778] drbd0: Writing meta data super block now. Aug 15 17:28:43 parastore02 kernel: [ 66.314471] drbd0: conn( StandAlone -> Unconnected ) Aug 15 17:28:43 parastore02 kernel: [ 66.314636] drbd0: Starting receiver thread (from drbd0_worker [5118]) Aug 15 17:28:43 parastore02 kernel: [ 66.315752] drbd0: receiver (re)started Aug 15 17:28:43 parastore02 kernel: [ 66.315764] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:28:44 parastore02 kernel: [ 67.017585] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:28:44 parastore02 kernel: [ 67.018675] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:28:44 parastore02 kernel: [ 67.018687] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:28:44 parastore02 kernel: [ 67.018706] drbd0: Starting asender thread (from drbd0_receiver [5137]) Aug 15 17:28:44 parastore02 kernel: [ 67.064460] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Aug 15 17:28:44 parastore02 kernel: [ 67.064476] drbd0: Writing meta data super block now. Aug 15 17:29:02 parastore02 kernel: [ 85.589243] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5 Aug 15 17:29:08 parastore02 kernel: [ 91.586558] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 4 Aug 15 17:29:14 parastore02 kernel: [ 97.583871] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 3 Aug 15 17:29:20 parastore02 kernel: [ 103.581185] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 2 Aug 15 17:29:26 parastore02 kernel: [ 109.578499] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 1 Aug 15 17:29:32 parastore02 kernel: [ 115.575814] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown ) Aug 15 17:29:32 parastore02 kernel: [ 115.575829] drbd0: short sent ReportBitMap size=4096 sent=3216 Aug 15 17:29:32 parastore02 kernel: [ 115.575925] drbd0: error receiving ReportBitMap, l: 0! Aug 15 17:29:32 parastore02 kernel: [ 115.576429] drbd0: role( Secondary -> Primary ) Aug 15 17:29:32 parastore02 kernel: [ 115.576441] drbd0: Creating new current UUID Aug 15 17:29:32 parastore02 kernel: [ 115.576451] drbd0: Writing meta data super block now. Aug 15 17:29:32 parastore02 kernel: [ 115.576548] drbd0: asender terminated Aug 15 17:29:32 parastore02 kernel: [ 115.576554] drbd0: Terminating asender thread Aug 15 17:29:32 parastore02 kernel: [ 115.577220] drbd0: tl_clear() Aug 15 17:29:32 parastore02 kernel: [ 115.577226] drbd0: Connection closed Aug 15 17:29:32 parastore02 kernel: [ 115.577233] drbd0: conn( Timeout -> Unconnected ) Aug 15 17:29:32 parastore02 kernel: [ 115.577237] drbd0: receiver terminated Aug 15 17:29:32 parastore02 kernel: [ 115.577240] drbd0: receiver (re)started Aug 15 17:29:32 parastore02 kernel: [ 115.577243] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:29:33 parastore02 kernel: [ 115.875697] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:29:33 parastore02 kernel: [ 115.876347] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:29:33 parastore02 kernel: [ 115.876359] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:29:33 parastore02 kernel: [ 115.876364] drbd0: Starting asender thread (from drbd0_receiver [5137]) Aug 15 17:29:33 parastore02 kernel: [ 115.915155] drbd0: meta connection shut down by peer. Aug 15 17:29:33 parastore02 kernel: [ 115.915226] drbd0: conn( WFReportParams -> NetworkFailure ) Aug 15 17:29:33 parastore02 kernel: [ 115.915236] drbd0: asender terminated Aug 15 17:29:33 parastore02 kernel: [ 115.915239] drbd0: Terminating asender thread Aug 15 17:29:33 parastore02 kernel: [ 115.916116] drbd0: tl_clear() Aug 15 17:29:33 parastore02 kernel: [ 115.916122] drbd0: Connection closed Aug 15 17:29:33 parastore02 kernel: [ 115.916130] drbd0: conn( NetworkFailure -> Unconnected ) Aug 15 17:29:33 parastore02 kernel: [ 115.916134] drbd0: receiver terminated Aug 15 17:29:33 parastore02 kernel: [ 115.916163] drbd0: receiver (re)started Aug 15 17:29:33 parastore02 kernel: [ 115.916168] drbd0: conn( Unconnected -> WFConnection ) config of drbd0: disk { size 0s _is_default; # bytes on-io-error detach; fencing dont-care _is_default; } net { timeout 60 _is_default; # 1/10 seconds max-epoch-size 2048 _is_default; max-buffers 2048 _is_default; unplug-watermark 128 _is_default; connect-int 10 _is_default; # seconds ping-int 10 _is_default; # seconds sndbuf-size 131070 _is_default; # bytes ko-count 6; allow-two-primaries; cram-hmac-alg "md5"; shared-secret "Para2008Store"; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect _is_default; rr-conflict disconnect _is_default; ping-timeout 5 _is_default; # 1/10 seconds } syncer { rate 5120k; # bytes/second after -1 _is_default; al-extents 1801; } protocol C; _this_host { device "/dev/drbd0"; disk "/dev/sda4"; meta-disk internal; address 192.168.99.2:7788; } _remote_host { address 192.168.99.1:7788; } -- "The greatest proof that intelligent life other that humans exists in the universe is that none of it has tried to contact us!"