[DRBD-user] DRBD stalled connection / state mismatch between primary and secondary

Support WVNET hilfe at wvnet.at
Tue May 10 12:23:13 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all ,

I'm not able to re-sync my primary and secondary storage server .
Both servers are identical , Setup looks like this

-System: Slackware 13.1 64bit , kernel 2.6.33.12 , OFED 1.5.2 Stack , drdb
8.3.10 from source
-Storage:
	Adaptec raid controller 52445 ( with BBU ) , 24x SAS
	raid Partitions
	drbd
	scst ib srpt_target ( vdisk blockio )
-Replication-Link: IPoIB Interface
-Cluster: 
	Pacemaker 1.1.4
    	Corosync 1.3.0
	2 Communication Links ( 1x crossover Gigbit ethernet , 1x IPoIB Link
)
	

State on the secondary node : ( storage-node-b )
---------------------------------------------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19

 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:0

10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:0

13: cs:Unconfigured

State on the primary node : ( storage-node-a )
------------------------------------------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:516 nr:0 dw:2601253 dr:1083196 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:635196

10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:480359772 nr:0 dw:5831668 dr:483513740 al:0 bm:29323 lo:0 pe:0 ua:0
ap:0 ep:1 wo:n oos:210044
        [===================>] sync'ed:100.0% (204/469100)M
        finish: 3:41:03 speed: 12 (3,812) K/sec (stalled)
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:38456751 dr:62544271 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16370960
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:9660832 dr:7130755 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684524
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
    ns:10052 nr:0 dw:14567958 dr:81131454 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3465252


I have found and older thread with the advise to disconnect/connect the
affected Resource
So I have tried this

root at storage-node-a:~# drbdadm disconnect r_mailspace
root at storage-node-a:~# drbdadm connect r_mailspace
root at storage-node-a:~# cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:516 nr:0 dw:2618682 dr:1095853 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:635364

10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:107416 nr:0 dw:5853702 dr:483621708 al:0 bm:29501 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:102656
        [=========>..........] sync'ed: 52.0% (102656/210044)K
        finish: 0:00:01 speed: 53,692 (53,692) K/sec
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:38527089 dr:62599136 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16389488
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:9669509 dr:7136228 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684528
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
    ns:10052 nr:0 dw:14630185 dr:81146002 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3466248

Logfiles from storage-node-a ( Primary ) for disconnecting/connecting
----------------------------------------------------------------------------
--------
[407119.085662] block drbd10: peer( Secondary -> Unknown ) conn( SyncSource
-> Disconnecting )
[407119.107755] block drbd10: meta connection shut down by peer.
[407119.107761] block drbd10: asender terminated
[407119.107764] block drbd10: Terminating asender thread
[407119.459885] block drbd10: bitmap WRITE of 466 pages took 375 jiffies
[407119.598375] block drbd10: 205 MB (52511 bits) marked out-of-sync by on
disk bit-map.
[407119.598391] block drbd10: Connection closed
[407119.598400] block drbd10: conn( Disconnecting -> StandAlone )
[407119.598431] block drbd10: receiver terminated
[407119.598434] block drbd10: Terminating receiver thread
[407123.476627] block drbd10: conn( StandAlone -> Unconnected )
[407123.476667] block drbd10: Starting receiver thread (from drbd10_worker
[14471])
[407123.476708] block drbd10: receiver (re)started
[407123.476714] block drbd10: conn( Unconnected -> WFConnection )
[407123.577535] block drbd10: Handshake successful: Agreed network protocol
version 96
[407123.577546] block drbd10: conn( WFConnection -> WFReportParams )
[407123.577647] block drbd10: Starting asender thread (from drbd10_receiver
[24920])
[407123.577737] block drbd10: data-integrity-alg: <not-used>
[407123.577836] block drbd10: drbd_sync_handshake:
[407123.577841] block drbd10: self
0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58
bits:52511 flags:0
[407123.577845] block drbd10: peer
0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0
flags:0
[407123.577849] block drbd10: was SyncSource, missed the resync finished
event, corrected myself:
[407123.577854] block drbd10: self
0001000000000001:0000000000000000:0002000000000000:0001000000000000
bits:52511 flags:0
[407123.577857] block drbd10: uuid_compare()=1 by rule 34
[407123.577864] block drbd10: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent )
[407124.012227] block drbd10: helper command: /sbin/drbdadm
before-resync-source minor-10
[407124.014695] block drbd10: helper command: /sbin/drbdadm
before-resync-source minor-10 exit code 0 (0x0)
[407124.014703] block drbd10: conn( WFBitMapS -> SyncSource ) pdsk(
Consistent -> Inconsistent )
[407124.014711] block drbd10: Began resync as SyncSource (will sync 210044
KB [52511 bits set]).
[407124.014762] block drbd10: updated sync UUID
0001000000000001:0001000000000000:0002000000000000:0001000000000000

Logfiles from storage-node-b ( Secondary ) for disconnecting/connecting
----------------------------------------------------------------------------
--------
[408067.916086] block drbd10: peer( Primary -> Unknown ) conn( Connected ->
TearDown ) pdsk( UpToDate -> DUnknown )
[408067.938062] block drbd10: asender terminated
[408067.938067] block drbd10: Terminating asender thread
[408067.938290] block drbd10: Connection closed
[408067.938295] block drbd10: conn( TearDown -> Unconnected )
[408067.938301] block drbd10: receiver terminated
[408067.938303] block drbd10: Restarting receiver thread
[408067.938306] block drbd10: receiver (re)started
[408067.938311] block drbd10: conn( Unconnected -> WFConnection )
[408072.417315] block drbd10: Handshake successful: Agreed network protocol
version 96
[408072.417324] block drbd10: conn( WFConnection -> WFReportParams )
[408072.417542] block drbd10: Starting asender thread (from drbd10_receiver
[32234])
[408072.417720] block drbd10: data-integrity-alg: <not-used>
[408072.417769] block drbd10: drbd_sync_handshake:
[408072.417772] block drbd10: self
0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0
flags:0
[408072.417776] block drbd10: peer
0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58
bits:52511 flags:2
[408072.417778] block drbd10: was SyncTarget, peer missed the resync
finished event, corrected peer:
[408072.417781] block drbd10: peer
0001000000000001:0000000000000000:0002000000000000:0001000000000000
bits:52511 flags:2
[408072.417784] block drbd10: uuid_compare()=-1 by rule 35
[408072.417789] block drbd10: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown ->
UpToDate )
[408072.851104] block drbd10: conn( WFBitMapT -> WFSyncUUID )
[408073.154475] block drbd10: updated sync uuid
0001000000000000:0000000000000000:0002000000000000:0001000000000000
[408073.177311] block drbd10: helper command: /sbin/drbdadm
before-resync-target minor-10
[408073.179989] block drbd10: helper command: /sbin/drbdadm
before-resync-target minor-10 exit code 0 (0x0)
[408073.179998] block drbd10: conn( WFSyncUUID -> SyncTarget ) disk(
Outdated -> Inconsistent )
[408073.180007] block drbd10: Began resync as SyncTarget (will sync 210044
KB [52511 bits set]).
[408077.968047] block drbd10: Resync done (total 4 sec; paused 0 sec; 52508
K/sec)
[408077.968054] block drbd10: updated UUIDs
0001000000000000:0000000000000000:0001000000000000:0002000000000000
[408077.968061] block drbd10: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[408078.344564] block drbd10: helper command: /sbin/drbdadm
after-resync-target minor-10
[408078.360396] block drbd10: helper command: /sbin/drbdadm
after-resync-target minor-10 exit code 1 (0x100)
[408078.371319] block drbd10: bitmap WRITE of 3865 pages took 11 jiffies
[408078.619232] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.

DRBD State storage-node-b ( secondary ) after disconnect/connect
----------------------------------------------------------------------------
--
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19

 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:0

10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:0

13: cs:Unconfigured


After 40 Minutes the state on the Primary hasn't much changed . Sync Speed
is slowing down , Secondary says everything ok
At the moment i have disabled all other resources on the secondary ( so the
state WFConnection is ok for all others )
I have also stopped the Clusterstack on the secondary node to avoid a
switchover to the ( maybe ) not actual data
I have tested the IPoIB Link with netperf , throughput ist about 8 Gbit/s -
looks good.


DRBD State storage-node-a ( primary ) : about 40 Minutes after
disconnect/connect
----------------------------------------------------------------------------
---------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:516 nr:0 dw:2676664 dr:1107571 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:636692

10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:210044 nr:0 dw:6011889 dr:483729658 al:0 bm:29688 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:111388
        [========>...........] sync'ed: 48.1% (111388/210044)K
        finish: 0:55:25 speed: 32 (32) K/sec
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:39626755 dr:63686135 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16514876
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
    ns:0 nr:0 dw:9755091 dr:7177902 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684572
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
    ns:10052 nr:0 dw:14946032 dr:81243128 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3489076


DRBD Configuration
-----------------------
global {
        usage-count no;
}

common {
        protocol C;

        handlers {
                fence-peer "/usr/local/lib/drbd/crm-fence-peer.sh";
                after-resync-target
"/usr/local/lib/drbd/crm-unfence-peer.sh";
                pri-on-incon-degr "echo b > /proc/sysrq-trigger";
                split-brain "/usr/local/lib/drbd/notify-split-brain.sh
root";
        }

        disk {
                on-io-error detach;
                fencing resource-only;
                no-disk-barrier;
                no-disk-flushes;
                no-disk-drain;

        }

        syncer {
                rate 100M;
                al-extents 511;
                verify-alg sha1;
        }

        net {
            max-buffers 8000;
            max-epoch-size 8000;
            sndbuf-size 0;
            after-sb-0pri discard-zero-changes;
            after-sb-1pri discard-secondary;
            after-sb-2pri disconnect;
        }
}

resource r_mailspace {

    on storage-node-a {
        device          /dev/drbd10;
        disk            /dev/sdb1;
        address         10.212.13.1:7791;
        meta-disk       internal;
    }

    on storage-node-b {
        device          /dev/drbd10;
        disk            /dev/sdb1;
        address         10.212.13.2:7791;
        meta-disk       internal;
    }
}

Any hints or comments ?

kind regards
Steve







More information about the drbd-user mailing list