[DRBD-user] sock_sendmsg time expired

Jacob Avraham jacob at compass-eos.com
Tue Aug 14 21:20:46 CEST 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,
I'm using drbd 8.3.7 on a 2.6.32 kernel .
This is running in a embedded environment in a card cage with 2 cards.
The cards are connected with an internal 10G MAC.

My drbd configuration is:

global {
    usage-count no;
}

common {
    protocol C;
    syncer {
        rate 10M;
    }
        net {
               ko-count  6;
    }
    handlers {
        # see also /etc/drbd.d/global_common.conf
        split-brain "/opt/compass/bin/drbd-notify-split-brain.sh";
    }
}

resource home {
    meta-disk internal;
    device /dev/drbd1;
    disk /dev/ssd2;
    on cpm0 {
        address 1.1.1.129:7788;
    }
    on cpm1 {
        address 1.1.1.130:7788;
    }
}

resource syslog {
    meta-disk internal;
    device /dev/drbd2;
    disk /dev/ssd3;
    on cpm0 {
        address 1.1.1.129:7789;
    }
    on cpm1 {
        address 1.1.1.130:7789;
    }
}

As you see my syncer rate is 10M, which is not too much for a 10G link.
BTW, tcpdump doesn't show much traffic other than DRBD.

When I get into a situation that requires a re-sync I keep getting sock_sendmsg timeouts, and the situation never heals.
Here's the console output on the primary node:

[  116.784427] block drbd1: Starting worker thread (from cqueue [4659])
[  116.803430] block drbd1: disk( Diskless -> Attaching )
[  116.813328] block drbd1: Found 4 transactions (4 active extents) in activity log.
[  116.825358] block drbd1: Method to ensure write ordering: barrier
[  116.832006] block drbd1: max_segment_size ( = BIO size ) = 32768
[  116.838643] block drbd1: drbd_bm_resize called with capacity == 88081624
[  116.846592] block drbd1: resync bitmap: bits=11010203 words=172035
[  116.853419] block drbd1: size = 42 GB (44040812 KB)
[  116.877240] block drbd1: recounting of set bits took additional 3 jiffies
[  116.885609] block drbd1: 7729 MB (1978523 bits) marked out-of-sync by on disk bit-map.
[  116.894682] block drbd1: Marked additional 4096 KB as out-of-sync based on AL.
[  116.903905] block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated )
[  117.019539] block drbd1: conn( StandAlone -> Unconnected )
[  117.025479] block drbd1: Starting receiver thread (from drbd1_worker [4664])
 [  117.057752] block drbd1: receiver (re)started
[  117.062463] block drbd1: conn( Unconnected -> WFConnection )
[  117.147161] block drbd2: Starting worker thread (from cqueue [4659])
[  117.154200] block drbd2: disk( Diskless -> Attaching )
[  117.161682] block drbd2: Found 4 transactions (8 active extents) in activity log.
[  117.169835] block drbd2: Method to ensure write ordering: barrier
[  117.176383] block drbd2: max_segment_size ( = BIO size ) = 32768
[  117.182833] block drbd2: drbd_bm_resize called with capacity == 20980168
[  117.190127] block drbd2: resync bitmap: bits=2622521 words=40977
[  117.196635] block drbd2: size = 10 GB (10490084 KB)
[  117.206486] block drbd2: recounting of set bits took additional 1 jiffies
[  117.214012] block drbd2: 1668 MB (427065 bits) marked out-of-sync by on disk bit-map.
[  117.222494] block drbd2: Marked additional 12 MB as out-of-sync based on AL.
[  117.230980] block drbd2: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated )
[  117.313623] block drbd2: conn( StandAlone -> Unconnected )
[  117.320820] block drbd2: Starting receiver thread (from drbd2_worker [4710])
[  117.328676] block drbd2: receiver (re)started
[  117.333448] block drbd2: conn( Unconnected -> WFConnection )
[  117.813264] block drbd2: role( Secondary -> Primary )
 [  127.324098] block drbd1: Handshake successful: Agreed network protocol version 91
[  127.332124] block drbd1: conn( WFConnection -> WFReportParams )
[  127.338691] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[  127.368208] block drbd1: data-integrity-alg: <not-used>
[  127.373997] block drbd1: drbd_sync_handshake:
[  127.378829] block drbd1: self 3475127B7403BB0B:8D44B59FE2BB68BF:0E1B69D867AAE02D:2D27340FC037E077 bits:1979547 flags:0
[  127.390381] block drbd1: peer 8D44B59FE2BB68BE:0000000000000000:0000000000000000:0000000000000000 bits:1978523 flags:0
[  127.401931] block drbd1: uuid_compare()=1 by rule 70
[  127.407382] block drbd1: Becoming sync source due to disk states.
[  127.414065] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
[  127.566844] block drbd2: Handshake successful: Agreed network protocol version 91
[  127.575037] block drbd2: conn( WFConnection -> WFReportParams )
[  127.581654] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[  127.589652] block drbd2: data-integrity-alg: <not-used>
[  127.595445] block drbd2: drbd_sync_handshake:
[  127.600247] block drbd2: self B89B0FA9261A88A7:13E00A97958E8BFD:D469C59FC5DBB2D0:B46B3E465F64FBCE bits:430137 flags:0
[  127.611695] block drbd2: peer 13E00A97958E8BFC:0000000000000000:0000000000000000:0000000000000000 bits:427065 flags:0
[  127.623370] block drbd2: uuid_compare()=1 by rule 70
[  127.628795] block drbd2: Becoming sync source due to disk states.
[  127.635429] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
[  127.671232] block drbd1: conn( WFBitMapS -> SyncSource )
[  127.677439] block drbd1: Began resync as SyncSource (will sync 7918188 KB [1979547 bits set]).
[  127.867559] block drbd2: conn( WFBitMapS -> SyncSource )
[  127.873426] block drbd2: Began resync as SyncSource (will sync 1720548 KB [430137 bits set]).
[  130.103616] JBD: barrier-based sync failed on drbd2-8 - disabling barriers
[  142.482440] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[  142.606298] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[  148.486473] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[  148.610322] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[  154.491468] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[  154.615339] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[  160.495491] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[  160.619364] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[  166.499522] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[  166.624422] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[  172.503544] block drbd2: drbd_send_block() failed
[  172.508664] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[  172.517517] block drbd2: drbd_pp_alloc interrupted!
[  172.522827] block drbd2: alloc_ee: Allocation of a page failed
[  172.529116] block drbd2: error receiving RSDataRequest, l: 24!
[  172.535380] block drbd2: asender terminated
[  172.539924] block drbd2: Terminating drbd2_asender
[  172.540060] block drbd2: Connection closed
[  172.540067] block drbd2: conn( NetworkFailure -> Unconnected )
[  172.540073] block drbd2: receiver terminated
[  172.540076] block drbd2: Restarting drbd2_receiver
[  172.540080] block drbd2: receiver (re)started
[  172.540086] block drbd2: conn( Unconnected -> WFConnection )
[  172.629405] block drbd1: drbd_send_block() failed
[  172.634522] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[  172.643429] block drbd1: drbd_pp_alloc interrupted!
[  172.648795] block drbd1: alloc_ee: Allocation of a page failed
[  172.655137] block drbd1: error receiving RSDataRequest, l: 24!
[  172.661566] block drbd1: asender terminated
[  172.666273] block drbd1: Terminating drbd1_asender
[  172.666366] block drbd1: Connection closed
[  172.666375] block drbd1: conn( NetworkFailure -> Unconnected )
[  172.666381] block drbd1: receiver terminated
[  172.666385] block drbd1: Restarting drbd1_receiver
[  172.666389] block drbd1: receiver (re)started
[  172.666395] block drbd1: conn( Unconnected -> WFConnection )
[  172.902451] block drbd2: Handshake successful: Agreed network protocol version 91
[  172.910548] block drbd2: conn( WFConnection -> WFReportParams )
[  172.917074] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[  172.924983] block drbd2: data-integrity-alg: <not-used>
[  172.930761] block drbd2: drbd_sync_handshake:
[  172.935531] block drbd2: self B89B0FA9261A88A7:EC151750C3343BBF:13E00A97958E8BFD:D469C59FC5DBB2D0 bits:423852 flags:0
[  172.947034] block drbd2: peer EC151750C3343BBE:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0
[  172.958516] block drbd2: uuid_compare()=1 by rule 70
[  172.963993] block drbd2: Becoming sync source due to disk states.
[  172.970690] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[  172.992500] block drbd2: conn( WFBitMapS -> SyncSource )
[  172.998432] block drbd2: Began resync as SyncSource (will sync 1695408 KB [423852 bits set]).
[  173.058406] block drbd1: Handshake successful: Agreed network protocol version 91
[  173.066554] block drbd1: conn( WFConnection -> WFReportParams )
[  173.073052] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[  173.081008] block drbd1: data-integrity-alg: <not-used>
[  173.086807] block drbd1: drbd_sync_handshake:
[  173.091627] block drbd1: self 3475127B7403BB0B:FD0FED39356C1BD2:8D44B59FE2BB68BF:0E1B69D867AAE02D bits:1977979 flags:0
[  173.103125] block drbd1: peer FD0FED39356C1BD2:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0
[  173.114608] block drbd1: uuid_compare()=1 by rule 70
[  173.120079] block drbd1: Becoming sync source due to disk states.
[  173.126763] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[  173.169295] block drbd1: conn( WFBitMapS -> SyncSource )
[  173.175262] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]).
[  191.031348] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[  194.232777] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[  197.035373] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[  200.236775] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[  203.039389] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[  206.240799] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[  209.044437] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[  212.244823] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[  215.048434] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[  218.248841] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[  221.052458] block drbd2: drbd_send_block() failed
[  221.057638] block drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[  221.066546] block drbd2: drbd_pp_alloc interrupted!
[  221.071978] block drbd2: alloc_ee: Allocation of a page failed
[  221.078315] block drbd2: error receiving RSDataRequest, l: 24!
[  221.084632] block drbd2: asender terminated
[  221.089256] block drbd2: Terminating drbd2_asender
[  221.089381] block drbd2: Connection closed
[  221.089389] block drbd2: conn( NetworkFailure -> Unconnected )
[  221.089394] block drbd2: receiver terminated
[  221.089396] block drbd2: Restarting drbd2_receiver
[  221.089400] block drbd2: receiver (re)started
[  221.089405] block drbd2: conn( Unconnected -> WFConnection )
[  221.461213] block drbd2: Handshake successful: Agreed network protocol version 91
[  221.469411] block drbd2: conn( WFConnection -> WFReportParams )
[  221.476062] block drbd2: Starting asender thread (from drbd2_receiver [4738])
[  221.483979] block drbd2: data-integrity-alg: <not-used>
[  221.489887] block drbd2: drbd_sync_handshake:
[  221.494746] block drbd2: self B89B0FA9261A88A7:91631364A53FB4F2:EC151750C3343BBF:13E00A97958E8BFD bits:423851 flags:0
[  221.506290] block drbd2: peer 91631364A53FB4F2:0000000000000000:0000000000000000:0000000000000000 bits:423841 flags:0
[  221.517838] block drbd2: uuid_compare()=1 by rule 70
[  221.523408] block drbd2: Becoming sync source due to disk states.
[  221.530182] block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[  221.553601] block drbd2: conn( WFBitMapS -> SyncSource )
[  221.559566] block drbd2: Began resync as SyncSource (will sync 1695404 KB [423851 bits set]).
[  224.252896] block drbd1: drbd_send_block() failed
[  224.258048] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[  224.266887] block drbd1: drbd_pp_alloc interrupted!
[  224.272236] block drbd1: alloc_ee: Allocation of a page failed
[  224.278451] block drbd1: error receiving RSDataRequest, l: 24!
[  224.284793] block drbd1: asender terminated
[  224.289390] block drbd1: Terminating drbd1_asender
[  224.289523] block drbd1: Connection closed
[  224.289533] block drbd1: conn( NetworkFailure -> Unconnected )
[  224.289540] block drbd1: receiver terminated
[  224.289545] block drbd1: Restarting drbd1_receiver
[  224.289551] block drbd1: receiver (re)started
[  224.289558] block drbd1: conn( Unconnected -> WFConnection )
[  224.659743] block drbd1: Handshake successful: Agreed network protocol version 91
[  224.667906] block drbd1: conn( WFConnection -> WFReportParams )
[  224.674475] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[  224.682466] block drbd1: data-integrity-alg: <not-used>
[  224.688127] block drbd1: drbd_sync_handshake:
[  224.692940] block drbd1: self 3475127B7403BB0B:90A869D6EAB8005B:FD0FED39356C1BD2:8D44B59FE2BB68BF bits:1977979 flags:0
[  224.704508] block drbd1: peer 90A869D6EAB8005A:0000000000000000:0000000000000000:0000000000000000 bits:1977979 flags:0
[  224.716017] block drbd1: uuid_compare()=1 by rule 70
[  224.721420] block drbd1: Becoming sync source due to disk states.
[  224.728012] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
[  224.776680] block drbd1: conn( WFBitMapS -> SyncSource )
[  224.782791] block drbd1: Began resync as SyncSource (will sync 7911916 KB [1977979 bits set]).
[  230.192821] IPv6 addrconf: prefix with wrong length 126
[  237.015598] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 5
[  239.595381] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 5
[  243.020626] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 4
[  245.599408] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 4
[  249.024669] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 3
[  251.604426] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 3
[  255.029664] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 2
[  257.609466] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 2
[  261.034700] block drbd1: [drbd1_worker/4664] sock_sendmsg time expired, ko = 1
[  263.614471] block drbd2: [drbd2_worker/4710] sock_sendmsg time expired, ko = 1
[  267.039708] block drbd1: drbd_send_block() failed
[  267.044821] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
[  267.053725] block drbd1: drbd_pp_alloc interrupted!
[  267.059179] block drbd1: alloc_ee: Allocation of a page failed
[  267.065563] block drbd1: error receiving RSDataRequest, l: 24!
[  267.071947] block drbd1: asender terminated
[  267.076520] block drbd1: Terminating drbd1_asender
[  267.076606] block drbd1: Connection closed
[  267.076614] block drbd1: conn( NetworkFailure -> Unconnected )
[  267.076619] block drbd1: receiver terminated
[  267.076622] block drbd1: Restarting drbd1_receiver
[  267.076627] block drbd1: receiver (re)started
[  267.076634] block drbd1: conn( Unconnected -> WFConnection )
[  267.446395] block drbd1: Handshake successful: Agreed network protocol version 91
[  267.454619] block drbd1: conn( WFConnection -> WFReportParams )
[  267.461377] block drbd1: Starting asender thread (from drbd1_receiver [4692])
[  267.469379] block drbd1: data-integrity-alg: <not-used>
[  267.475246] block drbd1: drbd_sync_handshake:


Is this a bug in DRBD? Other TCP traffic works fine.
Is this timeout configurable from DRBD side or is it TCP configuration?
I ran iperf and got TCP BW from Primary to Secondary to be 1.15Gbps and 420Mbps from Standby to Primary.

Any help is appreciate,

Thanks,
Jacob

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120814/10a79372/attachment.htm>


More information about the drbd-user mailing list