Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all , I'm not able to re-sync my primary and secondary storage server . Both servers are identical , Setup looks like this -System: Slackware 13.1 64bit , kernel 2.6.33.12 , OFED 1.5.2 Stack , drdb 8.3.10 from source -Storage: Adaptec raid controller 52445 ( with BBU ) , 24x SAS raid Partitions drbd scst ib srpt_target ( vdisk blockio ) -Replication-Link: IPoIB Interface -Cluster: Pacemaker 1.1.4 Corosync 1.3.0 2 Communication Links ( 1x crossover Gigbit ethernet , 1x IPoIB Link ) State on the secondary node : ( storage-node-b ) --------------------------------------------------------- version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:0 10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:0 13: cs:Unconfigured State on the primary node : ( storage-node-a ) ------------------------------------------------------ version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at storage-node-a.san.lokal, 2011-04-27 11:33:30 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:516 nr:0 dw:2601253 dr:1083196 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:635196 10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:480359772 nr:0 dw:5831668 dr:483513740 al:0 bm:29323 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:210044 [===================>] sync'ed:100.0% (204/469100)M finish: 3:41:03 speed: 12 (3,812) K/sec (stalled) 11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:38456751 dr:62544271 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:16370960 12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:9660832 dr:7130755 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2684524 13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r----- ns:10052 nr:0 dw:14567958 dr:81131454 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:3465252 I have found and older thread with the advise to disconnect/connect the affected Resource So I have tried this root at storage-node-a:~# drbdadm disconnect r_mailspace root at storage-node-a:~# drbdadm connect r_mailspace root at storage-node-a:~# cat /proc/drbd version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at storage-node-a.san.lokal, 2011-04-27 11:33:30 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:516 nr:0 dw:2618682 dr:1095853 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:635364 10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:107416 nr:0 dw:5853702 dr:483621708 al:0 bm:29501 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:102656 [=========>..........] sync'ed: 52.0% (102656/210044)K finish: 0:00:01 speed: 53,692 (53,692) K/sec 11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:38527089 dr:62599136 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:16389488 12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:9669509 dr:7136228 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2684528 13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r----- ns:10052 nr:0 dw:14630185 dr:81146002 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:3466248 Logfiles from storage-node-a ( Primary ) for disconnecting/connecting ---------------------------------------------------------------------------- -------- [407119.085662] block drbd10: peer( Secondary -> Unknown ) conn( SyncSource -> Disconnecting ) [407119.107755] block drbd10: meta connection shut down by peer. [407119.107761] block drbd10: asender terminated [407119.107764] block drbd10: Terminating asender thread [407119.459885] block drbd10: bitmap WRITE of 466 pages took 375 jiffies [407119.598375] block drbd10: 205 MB (52511 bits) marked out-of-sync by on disk bit-map. [407119.598391] block drbd10: Connection closed [407119.598400] block drbd10: conn( Disconnecting -> StandAlone ) [407119.598431] block drbd10: receiver terminated [407119.598434] block drbd10: Terminating receiver thread [407123.476627] block drbd10: conn( StandAlone -> Unconnected ) [407123.476667] block drbd10: Starting receiver thread (from drbd10_worker [14471]) [407123.476708] block drbd10: receiver (re)started [407123.476714] block drbd10: conn( Unconnected -> WFConnection ) [407123.577535] block drbd10: Handshake successful: Agreed network protocol version 96 [407123.577546] block drbd10: conn( WFConnection -> WFReportParams ) [407123.577647] block drbd10: Starting asender thread (from drbd10_receiver [24920]) [407123.577737] block drbd10: data-integrity-alg: <not-used> [407123.577836] block drbd10: drbd_sync_handshake: [407123.577841] block drbd10: self 0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58 bits:52511 flags:0 [407123.577845] block drbd10: peer 0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0 flags:0 [407123.577849] block drbd10: was SyncSource, missed the resync finished event, corrected myself: [407123.577854] block drbd10: self 0001000000000001:0000000000000000:0002000000000000:0001000000000000 bits:52511 flags:0 [407123.577857] block drbd10: uuid_compare()=1 by rule 34 [407123.577864] block drbd10: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent ) [407124.012227] block drbd10: helper command: /sbin/drbdadm before-resync-source minor-10 [407124.014695] block drbd10: helper command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0) [407124.014703] block drbd10: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) [407124.014711] block drbd10: Began resync as SyncSource (will sync 210044 KB [52511 bits set]). [407124.014762] block drbd10: updated sync UUID 0001000000000001:0001000000000000:0002000000000000:0001000000000000 Logfiles from storage-node-b ( Secondary ) for disconnecting/connecting ---------------------------------------------------------------------------- -------- [408067.916086] block drbd10: peer( Primary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) [408067.938062] block drbd10: asender terminated [408067.938067] block drbd10: Terminating asender thread [408067.938290] block drbd10: Connection closed [408067.938295] block drbd10: conn( TearDown -> Unconnected ) [408067.938301] block drbd10: receiver terminated [408067.938303] block drbd10: Restarting receiver thread [408067.938306] block drbd10: receiver (re)started [408067.938311] block drbd10: conn( Unconnected -> WFConnection ) [408072.417315] block drbd10: Handshake successful: Agreed network protocol version 96 [408072.417324] block drbd10: conn( WFConnection -> WFReportParams ) [408072.417542] block drbd10: Starting asender thread (from drbd10_receiver [32234]) [408072.417720] block drbd10: data-integrity-alg: <not-used> [408072.417769] block drbd10: drbd_sync_handshake: [408072.417772] block drbd10: self 0001000000000000:0000000000000000:0002000000000000:0001000000000000 bits:0 flags:0 [408072.417776] block drbd10: peer 0001000000000001:0002000000000000:0001000000000000:95D92A4B94991E58 bits:52511 flags:2 [408072.417778] block drbd10: was SyncTarget, peer missed the resync finished event, corrected peer: [408072.417781] block drbd10: peer 0001000000000001:0000000000000000:0002000000000000:0001000000000000 bits:52511 flags:2 [408072.417784] block drbd10: uuid_compare()=-1 by rule 35 [408072.417789] block drbd10: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) [408072.851104] block drbd10: conn( WFBitMapT -> WFSyncUUID ) [408073.154475] block drbd10: updated sync uuid 0001000000000000:0000000000000000:0002000000000000:0001000000000000 [408073.177311] block drbd10: helper command: /sbin/drbdadm before-resync-target minor-10 [408073.179989] block drbd10: helper command: /sbin/drbdadm before-resync-target minor-10 exit code 0 (0x0) [408073.179998] block drbd10: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) [408073.180007] block drbd10: Began resync as SyncTarget (will sync 210044 KB [52511 bits set]). [408077.968047] block drbd10: Resync done (total 4 sec; paused 0 sec; 52508 K/sec) [408077.968054] block drbd10: updated UUIDs 0001000000000000:0000000000000000:0001000000000000:0002000000000000 [408077.968061] block drbd10: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) [408078.344564] block drbd10: helper command: /sbin/drbdadm after-resync-target minor-10 [408078.360396] block drbd10: helper command: /sbin/drbdadm after-resync-target minor-10 exit code 1 (0x100) [408078.371319] block drbd10: bitmap WRITE of 3865 pages took 11 jiffies [408078.619232] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. DRBD State storage-node-b ( secondary ) after disconnect/connect ---------------------------------------------------------------------------- -- version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at storage-node-b.cluster.lokal, 2011-05-05 12:48:19 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:1175532 dw:1175532 dr:0 al:0 bm:8386 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:0 10: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:480569816 dw:480569816 dr:0 al:0 bm:29452 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:0 13: cs:Unconfigured After 40 Minutes the state on the Primary hasn't much changed . Sync Speed is slowing down , Secondary says everything ok At the moment i have disabled all other resources on the secondary ( so the state WFConnection is ok for all others ) I have also stopped the Clusterstack on the secondary node to avoid a switchover to the ( maybe ) not actual data I have tested the IPoIB Link with netperf , throughput ist about 8 Gbit/s - looks good. DRBD State storage-node-a ( primary ) : about 40 Minutes after disconnect/connect ---------------------------------------------------------------------------- --------------------- version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by root at storage-node-a.san.lokal, 2011-04-27 11:33:30 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:516 nr:0 dw:2676664 dr:1107571 al:0 bm:23 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:636692 10: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:210044 nr:0 dw:6011889 dr:483729658 al:0 bm:29688 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:111388 [========>...........] sync'ed: 48.1% (111388/210044)K finish: 0:55:25 speed: 32 (32) K/sec 11: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:39626755 dr:63686135 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:16514876 12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:9755091 dr:7177902 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2684572 13: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Inconsistent C r----- ns:10052 nr:0 dw:14946032 dr:81243128 al:0 bm:3319 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:3489076 DRBD Configuration ----------------------- global { usage-count no; } common { protocol C; handlers { fence-peer "/usr/local/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/local/lib/drbd/crm-unfence-peer.sh"; pri-on-incon-degr "echo b > /proc/sysrq-trigger"; split-brain "/usr/local/lib/drbd/notify-split-brain.sh root"; } disk { on-io-error detach; fencing resource-only; no-disk-barrier; no-disk-flushes; no-disk-drain; } syncer { rate 100M; al-extents 511; verify-alg sha1; } net { max-buffers 8000; max-epoch-size 8000; sndbuf-size 0; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } } resource r_mailspace { on storage-node-a { device /dev/drbd10; disk /dev/sdb1; address 10.212.13.1:7791; meta-disk internal; } on storage-node-b { device /dev/drbd10; disk /dev/sdb1; address 10.212.13.2:7791; meta-disk internal; } } Any hints or comments ? kind regards Steve