Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all ,
I'm not able to re-sync my primary and secondary storage server .
Both servers are identical , Setup looks like this
-System: Slackware 13.1 64bit , kernel 2.6.33.12 , OFED 1.5.2 Stack , drdb
8.3.10 from source
-Storage:
Adaptec raid controller 52445 ( with BBU ) , 24x SAS
raid Partitions
drbd
scst ib srpt_target ( vdisk blockio )
-Replication-Link: IPoIB Interface
-Cluster:
Pacemaker 1.1.4
Corosync 1.3.0
2 Communication Links ( 1x crossover Gigbit ethernet , 1x IPoIB Link
)
State on the secondary node : ( storage-node-b )
---------------------------------------------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:0
10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:0
13: cs:Unconfigured
State on the primary node : ( storage-node-a )
------------------------------------------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:516 nr:0 dw:2601253 dr:1083196 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:635196
10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:480359772 nr:0 dw:5831668 dr:483513740 al:0 bm:29323 lo:0 pe:0 ua:0
ap:0 ep:1 wo:n oos:210044
[===================>] sync'ed:100.0% (204/469100)M
finish: 3:41:03 speed: 12 (3,812) K/sec (stalled)
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:38456751 dr:62544271 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16370960
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:9660832 dr:7130755 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684524
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
ns:10052 nr:0 dw:14567958 dr:81131454 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3465252
I have found and older thread with the advise to disconnect/connect the
affected Resource
So I have tried this
root at storage-node-a:~# drbdadm disconnect r_mailspace
root at storage-node-a:~# drbdadm connect r_mailspace
root at storage-node-a:~# cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:516 nr:0 dw:2618682 dr:1095853 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:635364
10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:107416 nr:0 dw:5853702 dr:483621708 al:0 bm:29501 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:102656
[=========>..........] sync'ed: 52.0% (102656/210044)K
finish: 0:00:01 speed: 53,692 (53,692) K/sec
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:38527089 dr:62599136 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16389488
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:9669509 dr:7136228 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684528
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
ns:10052 nr:0 dw:14630185 dr:81146002 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3466248
Logfiles from storage-node-a ( Primary ) for disconnecting/connecting
----------------------------------------------------------------------------
--------
[407119.085662] block drbd10: peer( Secondary -> Unknown ) conn( SyncSource
-> Disconnecting )
[407119.107755] block drbd10: meta connection shut down by peer.
[407119.107761] block drbd10: asender terminated
[407119.107764] block drbd10: Terminating asender thread
[407119.459885] block drbd10: bitmap WRITE of 466 pages took 375 jiffies
[407119.598375] block drbd10: 205 MB (52511 bits) marked out-of-sync by on
disk bit-map.
[407119.598391] block drbd10: Connection closed
[407119.598400] block drbd10: conn( Disconnecting -> StandAlone )
[407119.598431] block drbd10: receiver terminated
[407119.598434] block drbd10: Terminating receiver thread
[407123.476627] block drbd10: conn( StandAlone -> Unconnected )
[407123.476667] block drbd10: Starting receiver thread (from drbd10_worker
[14471])
[407123.476708] block drbd10: receiver (re)started
[407123.476714] block drbd10: conn( Unconnected -> WFConnection )
[407123.577535] block drbd10: Handshake successful: Agreed network protocol
version 96
[407123.577546] block drbd10: conn( WFConnection -> WFReportParams )
[407123.577647] block drbd10: Starting asender thread (from drbd10_receiver
[24920])
[407123.577737] block drbd10: data-integrity-alg: <not-used>
[407123.577836] block drbd10: drbd_sync_handshake:
[407123.577841] block drbd10: self
0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58
bits:52511 flags:0
[407123.577845] block drbd10: peer
0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0
flags:0
[407123.577849] block drbd10: was SyncSource, missed the resync finished
event, corrected myself:
[407123.577854] block drbd10: self
0001000000000001:0000000000000000:0002000000000000:0001000000000000
bits:52511 flags:0
[407123.577857] block drbd10: uuid_compare()=1 by rule 34
[407123.577864] block drbd10: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent )
[407124.012227] block drbd10: helper command: /sbin/drbdadm
before-resync-source minor-10
[407124.014695] block drbd10: helper command: /sbin/drbdadm
before-resync-source minor-10 exit code 0 (0x0)
[407124.014703] block drbd10: conn( WFBitMapS -> SyncSource ) pdsk(
Consistent -> Inconsistent )
[407124.014711] block drbd10: Began resync as SyncSource (will sync 210044
KB [52511 bits set]).
[407124.014762] block drbd10: updated sync UUID
0001000000000001:0001000000000000:0002000000000000:0001000000000000
Logfiles from storage-node-b ( Secondary ) for disconnecting/connecting
----------------------------------------------------------------------------
--------
[408067.916086] block drbd10: peer( Primary -> Unknown ) conn( Connected ->
TearDown ) pdsk( UpToDate -> DUnknown )
[408067.938062] block drbd10: asender terminated
[408067.938067] block drbd10: Terminating asender thread
[408067.938290] block drbd10: Connection closed
[408067.938295] block drbd10: conn( TearDown -> Unconnected )
[408067.938301] block drbd10: receiver terminated
[408067.938303] block drbd10: Restarting receiver thread
[408067.938306] block drbd10: receiver (re)started
[408067.938311] block drbd10: conn( Unconnected -> WFConnection )
[408072.417315] block drbd10: Handshake successful: Agreed network protocol
version 96
[408072.417324] block drbd10: conn( WFConnection -> WFReportParams )
[408072.417542] block drbd10: Starting asender thread (from drbd10_receiver
[32234])
[408072.417720] block drbd10: data-integrity-alg: <not-used>
[408072.417769] block drbd10: drbd_sync_handshake:
[408072.417772] block drbd10: self
0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0
flags:0
[408072.417776] block drbd10: peer
0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58
bits:52511 flags:2
[408072.417778] block drbd10: was SyncTarget, peer missed the resync
finished event, corrected peer:
[408072.417781] block drbd10: peer
0001000000000001:0000000000000000:0002000000000000:0001000000000000
bits:52511 flags:2
[408072.417784] block drbd10: uuid_compare()=-1 by rule 35
[408072.417789] block drbd10: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown ->
UpToDate )
[408072.851104] block drbd10: conn( WFBitMapT -> WFSyncUUID )
[408073.154475] block drbd10: updated sync uuid
0001000000000000:0000000000000000:0002000000000000:0001000000000000
[408073.177311] block drbd10: helper command: /sbin/drbdadm
before-resync-target minor-10
[408073.179989] block drbd10: helper command: /sbin/drbdadm
before-resync-target minor-10 exit code 0 (0x0)
[408073.179998] block drbd10: conn( WFSyncUUID -> SyncTarget ) disk(
Outdated -> Inconsistent )
[408073.180007] block drbd10: Began resync as SyncTarget (will sync 210044
KB [52511 bits set]).
[408077.968047] block drbd10: Resync done (total 4 sec; paused 0 sec; 52508
K/sec)
[408077.968054] block drbd10: updated UUIDs
0001000000000000:0000000000000000:0001000000000000:0002000000000000
[408077.968061] block drbd10: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[408078.344564] block drbd10: helper command: /sbin/drbdadm
after-resync-target minor-10
[408078.360396] block drbd10: helper command: /sbin/drbdadm
after-resync-target minor-10 exit code 1 (0x100)
[408078.371319] block drbd10: bitmap WRITE of 3865 pages took 11 jiffies
[408078.619232] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
DRBD State storage-node-b ( secondary ) after disconnect/connect
----------------------------------------------------------------------------
--
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:0
10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:0
13: cs:Unconfigured
After 40 Minutes the state on the Primary hasn't much changed . Sync Speed
is slowing down , Secondary says everything ok
At the moment i have disabled all other resources on the secondary ( so the
state WFConnection is ok for all others )
I have also stopped the Clusterstack on the secondary node to avoid a
switchover to the ( maybe ) not actual data
I have tested the IPoIB Link with netperf , throughput ist about 8 Gbit/s -
looks good.
DRBD State storage-node-a ( primary ) : about 40 Minutes after
disconnect/connect
----------------------------------------------------------------------------
---------------------
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by
root at storage-node-a.san.lokal, 2011-04-27 11:33:30
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:516 nr:0 dw:2676664 dr:1107571 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:636692
10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:210044 nr:0 dw:6011889 dr:483729658 al:0 bm:29688 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:111388
[========>...........] sync'ed: 48.1% (111388/210044)K
finish: 0:55:25 speed: 32 (32) K/sec
11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:39626755 dr:63686135 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:n oos:16514876
12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:9755091 dr:7177902 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:2684572
13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r-----
ns:10052 nr:0 dw:14946032 dr:81243128 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0
ep:1 wo:n oos:3489076
DRBD Configuration
-----------------------
global {
usage-count no;
}
common {
protocol C;
handlers {
fence-peer "/usr/local/lib/drbd/crm-fence-peer.sh";
after-resync-target
"/usr/local/lib/drbd/crm-unfence-peer.sh";
pri-on-incon-degr "echo b > /proc/sysrq-trigger";
split-brain "/usr/local/lib/drbd/notify-split-brain.sh
root";
}
disk {
on-io-error detach;
fencing resource-only;
no-disk-barrier;
no-disk-flushes;
no-disk-drain;
}
syncer {
rate 100M;
al-extents 511;
verify-alg sha1;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}
resource r_mailspace {
on storage-node-a {
device /dev/drbd10;
disk /dev/sdb1;
address 10.212.13.1:7791;
meta-disk internal;
}
on storage-node-b {
device /dev/drbd10;
disk /dev/sdb1;
address 10.212.13.2:7791;
meta-disk internal;
}
}
Any hints or comments ?
kind regards
Steve