Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm using drbd 8.3.7 on a 2.6.32 kernel . This is running in a embedded environment in a card cage with 2 cards. The cards are connected with an internal 10G MAC. My drbd configuration is: global { usage-count no; } common { protocol C; syncer { rate 10M; } net { ko-count 6; } handlers { # see also /etc/drbd.d/global_common.conf split-brain "/opt/compass/bin/drbd-notify-split-brain.sh"; } } resource home { meta-disk internal; device /dev/drbd1; disk /dev/ssd2; on cpm0 { address 1.1.1.129:7788; } on cpm1 { address 1.1.1.130:7788; } } resource syslog { meta-disk internal; device /dev/drbd2; disk /dev/ssd3; on cpm0 { address 1.1.1.129:7789; } on cpm1 { address 1.1.1.130:7789; } } As you see my syncer rate is 10M, which is not too much for a 10G link. BTW, tcpdump doesn't show much traffic other than DRBD. When I get into a situation that requires a re-sync I keep getting sock_sendmsg timeouts, and the situation never heals. Here's the console output on the primary node: [ 116.784427] block drbd1: Starting worker thread (from cqueue [4659]) [ 116.803430] block drbd1: disk( Diskless -> Attaching ) [ 116.813328] block drbd1: Found 4 transactions (4 active extents) in activity log. [ 116.825358] block drbd1: Method to ensure write ordering: barrier [ 116.832006] block drbd1: max_segment_size ( = BIO size ) = 32768 [ 116.838643] block drbd1: drbd_bm_resize called with capacity == 88081624 [ 116.846592] block drbd1: resync bitmap: bits=11010203 words=172035 [ 116.853419] block drbd1: size = 42 GB (44040812 KB) [ 116.877240] block drbd1: recounting of set bits took additional 3 jiffies [ 116.885609] block drbd1: 7729 MB (1978523 bits) marked out-of-sync by on disk bit-map. [ 116.894682] block drbd1: Marked additional 4096 KB as out-of-sync based on AL. [ 116.903905] block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) [ 117.019539] block drbd1: conn( StandAlone -> Unconnected ) [ 117.025479] block drbd1: Starting receiver thread (from drbd1_worker [4664]) [ 117.057752] block drbd1: receiver (re)started [ 117.062463] block drbd1: conn( Unconnected -> WFConnection ) [ 117.147161] block drbd2: Starting worker thread (from cqueue [4659]) [ 117.154200] block drbd2: disk( Diskless -> Attaching ) [ 117.161682] block drbd2: Found 4 transactions (8 active extents) in activity log. [ 117.169835] block drbd2: Method to ensure write ordering: barrier [ 117.176383] block drbd2: max_segment_size ( = BIO size ) = 32768 [ 117.182833] block drbd2: drbd_bm_resize called with capacity == 20980168 [ 117.190127] block drbd2: resync bitmap: bits=2622521 words=40977 [ 117.196635] block drbd2: size = 10 GB (10490084 KB) [ 117.206486] block drbd2: recounting of set bits took additional 1 jiffies [ 117.214012] block drbd2: 1668 MB (427065 bits) marked out-of-sync by on disk bit-map. [ 117.222494] block drbd2: Marked additional 12 MB as out-of-sync based on AL. [ 117.230980] block drbd2: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) [ 117.313623] block drbd2: conn( StandAlone -> Unconnected ) [ 117.320820] block drbd2: Starting receiver thread (from drbd2_worker [4710]) [ 117.328676] block drbd2: receiver (re)started [ 117.333448] block drbd2: conn( Unconnected -> WFConnection ) [ 117.813264] block drbd2: role( Secondary -> Primary ) [ 127.324098] block drbd1: Handshake successful: Agreed network protocol version 91 [ 127.332124] block drbd1: conn( WFConnection -> WFReportParams ) [ 127.338691] block drbd1: Starting asender thread (from drbd1_receiver [4692]) [ 127.368208] block drbd1: data-integrity-alg: <not-used> [ 127.373997] block drbd1: drbd_sync_handshake: [ 127.378829] block drbd1: self 3475127B7403BB0B:8D44B59FE2BB68BF:0E1B69D867AAE02D:2D27340FC037E077 bits:1979547 flags:0 [ 127.390381] block drbd1: peer 8D44B59FE2BB68BE:0000000000000000:0000000000000000:0000000000000000 bits:1978523 flags:0 [ 127.401931] block drbd1: uuid_compare()=1 by rule 70 [ 127.407382] block drbd1: Becoming sync source due to disk states. [ 127.414065] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) [ 127.566844] block drbd2: Handshake successful: Agreed network protocol version 91 [ 127.575037] block drbd2: conn( WFConnection -> WFReportParams ) [ 127.581654] block drbd2: Starting asender thread (from drbd2_receiver [4738]) [ 127.589652] block drbd2: data-integrity-alg: <not-used> [ 127.595445] block drbd2: drbd_sync_handshake: [ 127.600247] block drbd2: self B89B0FA9261A88A7:13E00A97958E8BFD:D469C59FC5DBB2D0:B46B3E465F64FBCE bits:430137 flags:0 [ 127.611695] block drbd2: peer 13E00A97958E8BFC:0000000000000000:0000000000000000:0000000000000000 bits:427065 flags:0 [ 127.623370] block drbd2: uuid_compare()=1 by rule 70 [ 127.628795] block drbd2: Becoming sync source due to disk states. [ 127.635429] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) [ 127.671232] block drbd1: conn( WFBitMapS -> SyncSource ) [ 127.677439] block drbd1: Began resync as SyncSource (will sync 7918188 KB [1979547 bits set]). [ 127.867559] block drbd2: conn( WFBitMapS -> SyncSource ) [ 127.873426] block drbd2: Began resync as SyncSource (will sync 1720548 KB [430137 bits set]). [ 130.103616] JBD: barrier-based sync failed on drbd2-8 - disabling barriers [ 142.482440] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5 [ 142.606298] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5 [ 148.486473] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4 [ 148.610322] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4 [ 154.491468] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3 [ 154.615339] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3 [ 160.495491] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2 [ 160.619364] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2 [ 166.499522] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1 [ 166.624422] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1 [ 172.503544] block drbd2: drbd_send_block() failed [ 172.508664] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) [ 172.517517] block drbd2: drbd_pp_alloc interrupted! [ 172.522827] block drbd2: alloc_ee: Allocation of a page failed [ 172.529116] block drbd2: error receiving RSDataRequest, l: 24! [ 172.535380] block drbd2: asender terminated [ 172.539924] block drbd2: Terminating drbd2_asender [ 172.540060] block drbd2: Connection closed [ 172.540067] block drbd2: conn( NetworkFailure -> Unconnected ) [ 172.540073] block drbd2: receiver terminated [ 172.540076] block drbd2: Restarting drbd2_receiver [ 172.540080] block drbd2: receiver (re)started [ 172.540086] block drbd2: conn( Unconnected -> WFConnection ) [ 172.629405] block drbd1: drbd_send_block() failed [ 172.634522] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) [ 172.643429] block drbd1: drbd_pp_alloc interrupted! [ 172.648795] block drbd1: alloc_ee: Allocation of a page failed [ 172.655137] block drbd1: error receiving RSDataRequest, l: 24! [ 172.661566] block drbd1: asender terminated [ 172.666273] block drbd1: Terminating drbd1_asender [ 172.666366] block drbd1: Connection closed [ 172.666375] block drbd1: conn( NetworkFailure -> Unconnected ) [ 172.666381] block drbd1: receiver terminated [ 172.666385] block drbd1: Restarting drbd1_receiver [ 172.666389] block drbd1: receiver (re)started [ 172.666395] block drbd1: conn( Unconnected -> WFConnection ) [ 172.902451] block drbd2: Handshake successful: Agreed network protocol version 91 [ 172.910548] block drbd2: conn( WFConnection -> WFReportParams ) [ 172.917074] block drbd2: Starting asender thread (from drbd2_receiver [4738]) [ 172.924983] block drbd2: data-integrity-alg: <not-used> [ 172.930761] block drbd2: drbd_sync_handshake: [ 172.935531] block drbd2: self B89B0FA9261A88A7:EC151750C3343BBF:13E00A97958E8BFD:D469C59FC5DBB2D0 bits:423852 flags:0 [ 172.947034] block drbd2: peer EC151750C3343BBE:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0 [ 172.958516] block drbd2: uuid_compare()=1 by rule 70 [ 172.963993] block drbd2: Becoming sync source due to disk states. [ 172.970690] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) [ 172.992500] block drbd2: conn( WFBitMapS -> SyncSource ) [ 172.998432] block drbd2: Began resync as SyncSource (will sync 1695408 KB [423852 bits set]). [ 173.058406] block drbd1: Handshake successful: Agreed network protocol version 91 [ 173.066554] block drbd1: conn( WFConnection -> WFReportParams ) [ 173.073052] block drbd1: Starting asender thread (from drbd1_receiver [4692]) [ 173.081008] block drbd1: data-integrity-alg: <not-used> [ 173.086807] block drbd1: drbd_sync_handshake: [ 173.091627] block drbd1: self 3475127B7403BB0B:FD0FED39356C1BD2:8D44B59FE2BB68BF:0E1B69D867AAE02D bits:1977979 flags:0 [ 173.103125] block drbd1: peer FD0FED39356C1BD2:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0 [ 173.114608] block drbd1: uuid_compare()=1 by rule 70 [ 173.120079] block drbd1: Becoming sync source due to disk states. [ 173.126763] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) [ 173.169295] block drbd1: conn( WFBitMapS -> SyncSource ) [ 173.175262] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]). [ 191.031348] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5 [ 194.232777] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5 [ 197.035373] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4 [ 200.236775] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4 [ 203.039389] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3 [ 206.240799] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3 [ 209.044437] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2 [ 212.244823] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2 [ 215.048434] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1 [ 218.248841] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1 [ 221.052458] block drbd2: drbd_send_block() failed [ 221.057638] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) [ 221.066546] block drbd2: drbd_pp_alloc interrupted! [ 221.071978] block drbd2: alloc_ee: Allocation of a page failed [ 221.078315] block drbd2: error receiving RSDataRequest, l: 24! [ 221.084632] block drbd2: asender terminated [ 221.089256] block drbd2: Terminating drbd2_asender [ 221.089381] block drbd2: Connection closed [ 221.089389] block drbd2: conn( NetworkFailure -> Unconnected ) [ 221.089394] block drbd2: receiver terminated [ 221.089396] block drbd2: Restarting drbd2_receiver [ 221.089400] block drbd2: receiver (re)started [ 221.089405] block drbd2: conn( Unconnected -> WFConnection ) [ 221.461213] block drbd2: Handshake successful: Agreed network protocol version 91 [ 221.469411] block drbd2: conn( WFConnection -> WFReportParams ) [ 221.476062] block drbd2: Starting asender thread (from drbd2_receiver [4738]) [ 221.483979] block drbd2: data-integrity-alg: <not-used> [ 221.489887] block drbd2: drbd_sync_handshake: [ 221.494746] block drbd2: self B89B0FA9261A88A7:91631364A53FB4F2:EC151750C3343BBF:13E00A97958E8BFD bits:423851 flags:0 [ 221.506290] block drbd2: peer 91631364A53FB4F2:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0 [ 221.517838] block drbd2: uuid_compare()=1 by rule 70 [ 221.523408] block drbd2: Becoming sync source due to disk states. [ 221.530182] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) [ 221.553601] block drbd2: conn( WFBitMapS -> SyncSource ) [ 221.559566] block drbd2: Began resync as SyncSource (will sync 1695404 KB [423851 bits set]). [ 224.252896] block drbd1: drbd_send_block() failed [ 224.258048] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) [ 224.266887] block drbd1: drbd_pp_alloc interrupted! [ 224.272236] block drbd1: alloc_ee: Allocation of a page failed [ 224.278451] block drbd1: error receiving RSDataRequest, l: 24! [ 224.284793] block drbd1: asender terminated [ 224.289390] block drbd1: Terminating drbd1_asender [ 224.289523] block drbd1: Connection closed [ 224.289533] block drbd1: conn( NetworkFailure -> Unconnected ) [ 224.289540] block drbd1: receiver terminated [ 224.289545] block drbd1: Restarting drbd1_receiver [ 224.289551] block drbd1: receiver (re)started [ 224.289558] block drbd1: conn( Unconnected -> WFConnection ) [ 224.659743] block drbd1: Handshake successful: Agreed network protocol version 91 [ 224.667906] block drbd1: conn( WFConnection -> WFReportParams ) [ 224.674475] block drbd1: Starting asender thread (from drbd1_receiver [4692]) [ 224.682466] block drbd1: data-integrity-alg: <not-used> [ 224.688127] block drbd1: drbd_sync_handshake: [ 224.692940] block drbd1: self 3475127B7403BB0B:90A869D6EAB8005B:FD0FED39356C1BD2:8D44B59FE2BB68BF bits:1977979 flags:0 [ 224.704508] block drbd1: peer 90A869D6EAB8005A:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0 [ 224.716017] block drbd1: uuid_compare()=1 by rule 70 [ 224.721420] block drbd1: Becoming sync source due to disk states. [ 224.728012] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) [ 224.776680] block drbd1: conn( WFBitMapS -> SyncSource ) [ 224.782791] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]). [ 230.192821] IPv6 addrconf: prefix with wrong length 126 [ 237.015598] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5 [ 239.595381] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5 [ 243.020626] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4 [ 245.599408] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4 [ 249.024669] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3 [ 251.604426] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3 [ 255.029664] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2 [ 257.609466] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2 [ 261.034700] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1 [ 263.614471] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1 [ 267.039708] block drbd1: drbd_send_block() failed [ 267.044821] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) [ 267.053725] block drbd1: drbd_pp_alloc interrupted! [ 267.059179] block drbd1: alloc_ee: Allocation of a page failed [ 267.065563] block drbd1: error receiving RSDataRequest, l: 24! [ 267.071947] block drbd1: asender terminated [ 267.076520] block drbd1: Terminating drbd1_asender [ 267.076606] block drbd1: Connection closed [ 267.076614] block drbd1: conn( NetworkFailure -> Unconnected ) [ 267.076619] block drbd1: receiver terminated [ 267.076622] block drbd1: Restarting drbd1_receiver [ 267.076627] block drbd1: receiver (re)started [ 267.076634] block drbd1: conn( Unconnected -> WFConnection ) [ 267.446395] block drbd1: Handshake successful: Agreed network protocol version 91 [ 267.454619] block drbd1: conn( WFConnection -> WFReportParams ) [ 267.461377] block drbd1: Starting asender thread (from drbd1_receiver [4692]) [ 267.469379] block drbd1: data-integrity-alg: <not-used> [ 267.475246] block drbd1: drbd_sync_handshake: Is this a bug in DRBD? Other TCP traffic works fine. Is this timeout configurable from DRBD side or is it TCP configuration? I ran iperf and got TCP BW from Primary to Secondary to be 1.15Gbps and 420Mbps from Standby to Primary. Any help is appreciate, Thanks, Jacob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120814/10a79372/attachment.htm>