Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
I'm using drbd 8.3.7 on a 2.6.32 kernel .
This is running in a embedded environment in a card cage with 2 cards.
The cards are connected with an internal 10G MAC.
My drbd configuration is:
global {
usage-count no;
}
common {
protocol C;
syncer {
rate 10M;
}
net {
ko-count 6;
}
handlers {
# see also /etc/drbd.d/global_common.conf
split-brain "/opt/compass/bin/drbd-notify-split-brain.sh";
}
}
resource home {
meta-disk internal;
device /dev/drbd1;
disk /dev/ssd2;
on cpm0 {
address 1.1.1.129:7788;
}
on cpm1 {
address 1.1.1.130:7788;
}
}
resource syslog {
meta-disk internal;
device /dev/drbd2;
disk /dev/ssd3;
on cpm0 {
address 1.1.1.129:7789;
}
on cpm1 {
address 1.1.1.130:7789;
}
}
As you see my syncer rate is 10M, which is not too much for a 10G link.
BTW, tcpdump doesn't show much traffic other than DRBD.
When I get into a situation that requires a re-sync I keep getting sock_sendmsg timeouts, and the situation never heals.
Here's the console output on the primary node:
[ 116.784427] block drbd1: Starting worker thread (from cqueue [4659])
[ 116.803430] block drbd1: disk( Diskless -> Attaching )
[ 116.813328] block drbd1: Found 4 transactions (4 active extents) in activity log.
[ 116.825358] block drbd1: Method to ensure write ordering: barrier
[ 116.832006] block drbd1: max_segment_size ( = BIO size ) = 32768
[ 116.838643] block drbd1: drbd_bm_resize called with capacity == 88081624
[ 116.846592] block drbd1: resync bitmap: bits=11010203 words=172035
[ 116.853419] block drbd1: size = 42 GB (44040812 KB)
[ 116.877240] block drbd1: recounting of set bits took additional 3 jiffies
[ 116.885609] block drbd1: 7729 MB (1978523 bits) marked out-of-sync by on disk bit-map.
[ 116.894682] block drbd1: Marked additional 4096 KB as out-of-sync based on AL.
[ 116.903905] block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated )
[ 117.019539] block drbd1: conn( StandAlone -> Unconnected )
[ 117.025479] block drbd1: Starting receiver thread (from drbd1_worker [4664])
[ 117.057752] block drbd1: receiver (re)started
[ 117.062463] block drbd1: conn( Unconnected -> WFConnection )
[ 117.147161] block drbd2: Starting worker thread (from cqueue [4659])
[ 117.154200] block drbd2: disk( Diskless -> Attaching )
[ 117.161682] block drbd2: Found 4 transactions (8 active extents) in activity log.
[ 117.169835] block drbd2: Method to ensure write ordering: barrier
[ 117.176383] block drbd2: max_segment_size ( = BIO size ) = 32768
[ 117.182833] block drbd2: drbd_bm_resize called with capacity == 20980168
[ 117.190127] block drbd2: resync bitmap: bits=2622521 words=40977
[ 117.196635] block drbd2: size = 10 GB (10490084 KB)
[ 117.206486] block drbd2: recounting of set bits took additional 1 jiffies
[ 117.214012] block drbd2: 1668 MB (427065 bits) marked out-of-sync by on disk bit-map.
[ 117.222494] block drbd2: Marked additional 12 MB as out-of-sync based on AL.
[ 117.230980] block drbd2: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated )
[ 117.313623] block drbd2: conn( StandAlone -> Unconnected )
[ 117.320820] block drbd2: Starting receiver thread (from drbd2_worker [4710])
[ 117.328676] block drbd2: receiver (re)started
[ 117.333448] block drbd2: conn( Unconnected -> WFConnection )
[ 117.813264] block drbd2: role( Secondary -> Primary )
[ 127.324098] block drbd1: Handshake successful: Agreed network protocol version 91
[ 127.332124] block drbd1: conn( WFConnection -> WFReportParams )
[ 127.338691] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[ 127.368208] block drbd1: data-integrity-alg: <not-used>
[ 127.373997] block drbd1: drbd_sync_handshake:
[ 127.378829] block drbd1: self 3475127B7403BB0B:8D44B59FE2BB68BF:0E1B69D867AAE02D:2D27340FC037E077 bits:1979547 flags:0
[ 127.390381] block drbd1: peer 8D44B59FE2BB68BE:0000000000000000:0000000000000000:0000000000000000 bits:1978523 flags:0
[ 127.401931] block drbd1: uuid_compare()=1 by rule 70
[ 127.407382] block drbd1: Becoming sync source due to disk states.
[ 127.414065] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
[ 127.566844] block drbd2: Handshake successful: Agreed network protocol version 91
[ 127.575037] block drbd2: conn( WFConnection -> WFReportParams )
[ 127.581654] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[ 127.589652] block drbd2: data-integrity-alg: <not-used>
[ 127.595445] block drbd2: drbd_sync_handshake:
[ 127.600247] block drbd2: self B89B0FA9261A88A7:13E00A97958E8BFD:D469C59FC5DBB2D0:B46B3E465F64FBCE bits:430137 flags:0
[ 127.611695] block drbd2: peer 13E00A97958E8BFC:0000000000000000:0000000000000000:0000000000000000 bits:427065 flags:0
[ 127.623370] block drbd2: uuid_compare()=1 by rule 70
[ 127.628795] block drbd2: Becoming sync source due to disk states.
[ 127.635429] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
[ 127.671232] block drbd1: conn( WFBitMapS -> SyncSource )
[ 127.677439] block drbd1: Began resync as SyncSource (will sync 7918188 KB [1979547 bits set]).
[ 127.867559] block drbd2: conn( WFBitMapS -> SyncSource )
[ 127.873426] block drbd2: Began resync as SyncSource (will sync 1720548 KB [430137 bits set]).
[ 130.103616] JBD: barrier-based sync failed on drbd2-8 - disabling barriers
[ 142.482440] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[ 142.606298] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[ 148.486473] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[ 148.610322] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[ 154.491468] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[ 154.615339] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[ 160.495491] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[ 160.619364] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[ 166.499522] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[ 166.624422] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[ 172.503544] block drbd2: drbd_send_block() failed
[ 172.508664] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[ 172.517517] block drbd2: drbd_pp_alloc interrupted!
[ 172.522827] block drbd2: alloc_ee: Allocation of a page failed
[ 172.529116] block drbd2: error receiving RSDataRequest, l: 24!
[ 172.535380] block drbd2: asender terminated
[ 172.539924] block drbd2: Terminating drbd2_asender
[ 172.540060] block drbd2: Connection closed
[ 172.540067] block drbd2: conn( NetworkFailure -> Unconnected )
[ 172.540073] block drbd2: receiver terminated
[ 172.540076] block drbd2: Restarting drbd2_receiver
[ 172.540080] block drbd2: receiver (re)started
[ 172.540086] block drbd2: conn( Unconnected -> WFConnection )
[ 172.629405] block drbd1: drbd_send_block() failed
[ 172.634522] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[ 172.643429] block drbd1: drbd_pp_alloc interrupted!
[ 172.648795] block drbd1: alloc_ee: Allocation of a page failed
[ 172.655137] block drbd1: error receiving RSDataRequest, l: 24!
[ 172.661566] block drbd1: asender terminated
[ 172.666273] block drbd1: Terminating drbd1_asender
[ 172.666366] block drbd1: Connection closed
[ 172.666375] block drbd1: conn( NetworkFailure -> Unconnected )
[ 172.666381] block drbd1: receiver terminated
[ 172.666385] block drbd1: Restarting drbd1_receiver
[ 172.666389] block drbd1: receiver (re)started
[ 172.666395] block drbd1: conn( Unconnected -> WFConnection )
[ 172.902451] block drbd2: Handshake successful: Agreed network protocol version 91
[ 172.910548] block drbd2: conn( WFConnection -> WFReportParams )
[ 172.917074] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[ 172.924983] block drbd2: data-integrity-alg: <not-used>
[ 172.930761] block drbd2: drbd_sync_handshake:
[ 172.935531] block drbd2: self B89B0FA9261A88A7:EC151750C3343BBF:13E00A97958E8BFD:D469C59FC5DBB2D0 bits:423852 flags:0
[ 172.947034] block drbd2: peer EC151750C3343BBE:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0
[ 172.958516] block drbd2: uuid_compare()=1 by rule 70
[ 172.963993] block drbd2: Becoming sync source due to disk states.
[ 172.970690] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[ 172.992500] block drbd2: conn( WFBitMapS -> SyncSource )
[ 172.998432] block drbd2: Began resync as SyncSource (will sync 1695408 KB [423852 bits set]).
[ 173.058406] block drbd1: Handshake successful: Agreed network protocol version 91
[ 173.066554] block drbd1: conn( WFConnection -> WFReportParams )
[ 173.073052] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[ 173.081008] block drbd1: data-integrity-alg: <not-used>
[ 173.086807] block drbd1: drbd_sync_handshake:
[ 173.091627] block drbd1: self 3475127B7403BB0B:FD0FED39356C1BD2:8D44B59FE2BB68BF:0E1B69D867AAE02D bits:1977979 flags:0
[ 173.103125] block drbd1: peer FD0FED39356C1BD2:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0
[ 173.114608] block drbd1: uuid_compare()=1 by rule 70
[ 173.120079] block drbd1: Becoming sync source due to disk states.
[ 173.126763] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[ 173.169295] block drbd1: conn( WFBitMapS -> SyncSource )
[ 173.175262] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]).
[ 191.031348] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[ 194.232777] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[ 197.035373] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[ 200.236775] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[ 203.039389] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[ 206.240799] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[ 209.044437] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[ 212.244823] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[ 215.048434] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[ 218.248841] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[ 221.052458] block drbd2: drbd_send_block() failed
[ 221.057638] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[ 221.066546] block drbd2: drbd_pp_alloc interrupted!
[ 221.071978] block drbd2: alloc_ee: Allocation of a page failed
[ 221.078315] block drbd2: error receiving RSDataRequest, l: 24!
[ 221.084632] block drbd2: asender terminated
[ 221.089256] block drbd2: Terminating drbd2_asender
[ 221.089381] block drbd2: Connection closed
[ 221.089389] block drbd2: conn( NetworkFailure -> Unconnected )
[ 221.089394] block drbd2: receiver terminated
[ 221.089396] block drbd2: Restarting drbd2_receiver
[ 221.089400] block drbd2: receiver (re)started
[ 221.089405] block drbd2: conn( Unconnected -> WFConnection )
[ 221.461213] block drbd2: Handshake successful: Agreed network protocol version 91
[ 221.469411] block drbd2: conn( WFConnection -> WFReportParams )
[ 221.476062] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[ 221.483979] block drbd2: data-integrity-alg: <not-used>
[ 221.489887] block drbd2: drbd_sync_handshake:
[ 221.494746] block drbd2: self B89B0FA9261A88A7:91631364A53FB4F2:EC151750C3343BBF:13E00A97958E8BFD bits:423851 flags:0
[ 221.506290] block drbd2: peer 91631364A53FB4F2:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0
[ 221.517838] block drbd2: uuid_compare()=1 by rule 70
[ 221.523408] block drbd2: Becoming sync source due to disk states.
[ 221.530182] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[ 221.553601] block drbd2: conn( WFBitMapS -> SyncSource )
[ 221.559566] block drbd2: Began resync as SyncSource (will sync 1695404 KB [423851 bits set]).
[ 224.252896] block drbd1: drbd_send_block() failed
[ 224.258048] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[ 224.266887] block drbd1: drbd_pp_alloc interrupted!
[ 224.272236] block drbd1: alloc_ee: Allocation of a page failed
[ 224.278451] block drbd1: error receiving RSDataRequest, l: 24!
[ 224.284793] block drbd1: asender terminated
[ 224.289390] block drbd1: Terminating drbd1_asender
[ 224.289523] block drbd1: Connection closed
[ 224.289533] block drbd1: conn( NetworkFailure -> Unconnected )
[ 224.289540] block drbd1: receiver terminated
[ 224.289545] block drbd1: Restarting drbd1_receiver
[ 224.289551] block drbd1: receiver (re)started
[ 224.289558] block drbd1: conn( Unconnected -> WFConnection )
[ 224.659743] block drbd1: Handshake successful: Agreed network protocol version 91
[ 224.667906] block drbd1: conn( WFConnection -> WFReportParams )
[ 224.674475] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[ 224.682466] block drbd1: data-integrity-alg: <not-used>
[ 224.688127] block drbd1: drbd_sync_handshake:
[ 224.692940] block drbd1: self 3475127B7403BB0B:90A869D6EAB8005B:FD0FED39356C1BD2:8D44B59FE2BB68BF bits:1977979 flags:0
[ 224.704508] block drbd1: peer 90A869D6EAB8005A:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0
[ 224.716017] block drbd1: uuid_compare()=1 by rule 70
[ 224.721420] block drbd1: Becoming sync source due to disk states.
[ 224.728012] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[ 224.776680] block drbd1: conn( WFBitMapS -> SyncSource )
[ 224.782791] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]).
[ 230.192821] IPv6 addrconf: prefix with wrong length 126
[ 237.015598] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[ 239.595381] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[ 243.020626] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[ 245.599408] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[ 249.024669] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[ 251.604426] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[ 255.029664] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[ 257.609466] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[ 261.034700] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[ 263.614471] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[ 267.039708] block drbd1: drbd_send_block() failed
[ 267.044821] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[ 267.053725] block drbd1: drbd_pp_alloc interrupted!
[ 267.059179] block drbd1: alloc_ee: Allocation of a page failed
[ 267.065563] block drbd1: error receiving RSDataRequest, l: 24!
[ 267.071947] block drbd1: asender terminated
[ 267.076520] block drbd1: Terminating drbd1_asender
[ 267.076606] block drbd1: Connection closed
[ 267.076614] block drbd1: conn( NetworkFailure -> Unconnected )
[ 267.076619] block drbd1: receiver terminated
[ 267.076622] block drbd1: Restarting drbd1_receiver
[ 267.076627] block drbd1: receiver (re)started
[ 267.076634] block drbd1: conn( Unconnected -> WFConnection )
[ 267.446395] block drbd1: Handshake successful: Agreed network protocol version 91
[ 267.454619] block drbd1: conn( WFConnection -> WFReportParams )
[ 267.461377] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[ 267.469379] block drbd1: data-integrity-alg: <not-used>
[ 267.475246] block drbd1: drbd_sync_handshake:
Is this a bug in DRBD? Other TCP traffic works fine.
Is this timeout configurable from DRBD side or is it TCP configuration?
I ran iperf and got TCP BW from Primary to Secondary to be 1.15Gbps and 420Mbps from Standby to Primary.
Any help is appreciate,
Thanks,
Jacob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120814/10a79372/attachment.htm>