Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I configured a two node cluster and able to mirror them. But facing issue when testing some scenarios. Steps i followed and end up in standalone mode is 1. Shutdowm the primary node <N1>. 2. make the other node<N2> primary - make some changes in file system. 3. shutdown the N2. 4. start the N1, make it primary , do some changes in file system and reboot N1 and start N2. 5. when they come up N1 state is 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----- 6. The state of N2 is 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- But i have added auto recovery split brain algorithm in my config file. Please see the config file at the end. Please let me know how to automatically solve this standalone situation. Logs from N2: kernel: drbd: initialized. Version: 8.3.12 (api:88/proto:86-96) Feb 23 16:03:19 lab1602 kernel: drbd: GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag at Build32R6, 2011-11-20 10:55:07 Feb 23 16:03:19 lab1602 kernel: drbd: registered as block device major 147 Feb 23 16:03:19 lab1602 kernel: drbd: minor_table @ 0xf40ac700 Feb 23 16:03:19 lab1602 kernel: block drbd0: Starting worker thread (from cqueue [1538]) Feb 23 16:03:19 lab1602 kernel: block drbd0: disk( Diskless -> Attaching ) Feb 23 16:03:19 lab1602 kernel: block drbd0: Found 4 transactions (6 active extents) in activity log. Feb 23 16:03:19 lab1602 kernel: block drbd0: Method to ensure write ordering: drain Feb 23 16:03:19 lab1602 kernel: block drbd0: max BIO size = 131072 Feb 23 16:03:19 lab1602 kernel: block drbd0: drbd_bm_resize called with capacity == 2097152 Feb 23 16:03:19 lab1602 kernel: block drbd0: resync bitmap: bits=262144 words=8192 pages=8 Feb 23 16:03:19 lab1602 kernel: block drbd0: size = 1024 MB (1048576 KB) Feb 23 16:03:19 lab1602 kernel: block drbd0: bitmap READ of 8 pages took 1 jiffies Feb 23 16:03:19 lab1602 kernel: block drbd0: recounting of set bits took additional 0 jiffies Feb 23 16:03:19 lab1602 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Feb 23 16:03:19 lab1602 kernel: block drbd0: Marked additional 12 MB as out-of-sync based on AL. Feb 23 16:03:19 lab1602 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Feb 23 16:03:19 lab1602 kernel: block drbd0: 12 MB (3072 bits) marked out-of-sync by on disk bit-map. Feb 23 16:03:19 lab1602 kernel: block drbd0: disk( Attaching -> UpToDate ) Feb 23 16:03:19 lab1602 kernel: block drbd0: attached to UUIDs 1CAAF7BF141F6FAB:77403D46BDEFCA05:022A37BAC8B2DC72:022937BAC8B2DC72 Feb 23 16:03:19 lab1602 kernel: block drbd0: conn( StandAlone -> Unconnected ) Feb 23 16:03:19 lab1602 kernel: block drbd0: Starting receiver thread (from drbd0_worker [1547]) Feb 23 16:03:19 lab1602 kernel: block drbd0: receiver (re)started Feb 23 16:03:19 lab1602 kernel: block drbd0: conn( Unconnected -> WFConnection ) Feb 23 16:03:37 lab1602 kernel: block drbd0: Handshake successful: Agreed network protocol version 96 Feb 23 16:03:37 lab1602 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Feb 23 16:03:37 lab1602 kernel: block drbd0: conn( WFConnection -> WFReportParams ) Feb 23 16:03:37 lab1602 kernel: block drbd0: Starting asender thread (from drbd0_receiver [1565]) Feb 23 16:03:37 lab1602 kernel: block drbd0: data-integrity-alg: <not-used> Feb 23 16:03:37 lab1602 kernel: block drbd0: drbd_sync_handshake: Feb 23 16:03:37 lab1602 kernel: block drbd0: self 1CAAF7BF141F6FAB:77403D46BDEFCA05:022A37BAC8B2DC72:022937BAC8B2DC72 bits:3072 flags:0 Feb 23 16:03:37 lab1602 kernel: block drbd0: peer 1C7B811C8FAD3496:77403D46BDEFCA04:022A37BAC8B2DC72:022937BAC8B2DC72 bits:25 flags:0 Feb 23 16:03:37 lab1602 kernel: block drbd0: uuid_compare()=100 by rule 90 Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) Feb 23 16:03:37 lab1602 kernel: block drbd0: State change failed: Device is held open by someone Feb 23 16:03:37 lab1602 kernel: block drbd0: state = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown r----- } Feb 23 16:03:37 lab1602 kernel: block drbd0: wanted = { cs:WFReportParams ro:Secondary/Unknown ds:UpToDate/DUnknown r----- } Feb 23 16:03:37 lab1602 kernel: block drbd0: State change failed: Device is held open by someone Feb 23 16:03:37 lab1602 kernel: block drbd0: state = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown r----- } Feb 23 16:03:37 lab1602 kernel: block drbd0: wanted = { cs:WFReportParams ro:Secondary/Unknown ds:UpToDate/DUnknown r----- } Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm pri-lost-after-sb minor-0 Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm pri-lost-after-sb minor-0 exit code 0 (0x0) Feb 23 16:03:37 lab1602 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection! Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Feb 23 16:03:37 lab1602 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Feb 23 16:03:37 lab1602 kernel: block drbd0: conn( WFReportParams -> Disconnecting ) Feb 23 16:03:37 lab1602 kernel: block drbd0: error receiving ReportState, l: 4! Feb 23 16:03:37 lab1602 kernel: block drbd0: asender terminated Feb 23 16:03:37 lab1602 kernel: block drbd0: Terminating asender thread Feb 23 16:03:37 lab1602 kernel: block drbd0: Connection closed Feb 23 16:03:37 lab1602 kernel: block drbd0: conn( Disconnecting -> StandAlone ) Feb 23 16:03:37 lab1602 kernel: block drbd0: receiver terminated Feb 23 16:03:37 lab1602 kernel: block drbd0: Terminating receiver thread Logs From N1: Feb 23 16:03:54 lab1601 kernel: drbd: initialized. Version: 8.3.12 (api:88/proto:86-96) Feb 23 16:03:54 lab1601 kernel: drbd: GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag at Build32R6, 2011-11-20 10:55:07 Feb 23 16:03:54 lab1601 kernel: drbd: registered as block device major 147 Feb 23 16:03:54 lab1601 kernel: drbd: minor_table @ 0xf3d3e300 Feb 23 16:03:54 lab1601 kernel: block drbd0: Starting worker thread (from cqueue [1537]) Feb 23 16:03:54 lab1601 kernel: block drbd0: disk( Diskless -> Attaching ) Feb 23 16:03:54 lab1601 kernel: block drbd0: Found 4 transactions (6 active extents) in activity log. Feb 23 16:03:54 lab1601 kernel: block drbd0: Method to ensure write ordering: drain Feb 23 16:03:54 lab1601 kernel: block drbd0: max BIO size = 131072 Feb 23 16:03:54 lab1601 kernel: block drbd0: drbd_bm_resize called with capacity == 2097152 Feb 23 16:03:54 lab1601 kernel: block drbd0: resync bitmap: bits=262144 words=8192 pages=8 Feb 23 16:03:54 lab1601 kernel: block drbd0: size = 1024 MB (1048576 KB) Feb 23 16:03:54 lab1601 kernel: block drbd0: bitmap READ of 8 pages took 2 jiffies Feb 23 16:03:54 lab1601 kernel: block drbd0: recounting of set bits took additional 0 jiffies Feb 23 16:03:54 lab1601 kernel: block drbd0: 100 KB (25 bits) marked out-of-sync by on disk bit-map. Feb 23 16:03:54 lab1601 kernel: block drbd0: disk( Attaching -> UpToDate ) Feb 23 16:03:54 lab1601 kernel: block drbd0: attached to UUIDs 1C7B811C8FAD3497:77403D46BDEFCA04:022A37BAC8B2DC72:022937BAC8B2DC72 Feb 23 16:03:54 lab1601 kernel: block drbd0: conn( StandAlone -> Unconnected ) Feb 23 16:03:54 lab1601 kernel: block drbd0: Starting receiver thread (from drbd0_worker [1547]) Feb 23 16:03:54 lab1601 kernel: block drbd0: receiver (re)started Feb 23 16:03:54 lab1601 kernel: block drbd0: conn( Unconnected -> WFConnection ) Feb 23 16:03:55 lab1601 kernel: block drbd0: Handshake successful: Agreed network protocol version 96 Feb 23 16:03:55 lab1601 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Feb 23 16:03:55 lab1601 kernel: block drbd0: conn( WFConnection -> WFReportParams ) Feb 23 16:03:55 lab1601 kernel: block drbd0: Starting asender thread (from drbd0_receiver [1564]) Feb 23 16:03:55 lab1601 kernel: block drbd0: data-integrity-alg: <not-used> Feb 23 16:03:55 lab1601 kernel: block drbd0: drbd_sync_handshake: Feb 23 16:03:55 lab1601 kernel: block drbd0: self 1C7B811C8FAD3496:77403D46BDEFCA04:022A37BAC8B2DC72:022937BAC8B2DC72 bits:25 flags:0 Feb 23 16:03:55 lab1601 kernel: block drbd0: peer 1CAAF7BF141F6FAB:77403D46BDEFCA05:022A37BAC8B2DC72:022937BAC8B2DC72 bits:3072 flags:2 Feb 23 16:03:55 lab1601 kernel: block drbd0: uuid_compare()=100 by rule 90 Feb 23 16:03:55 lab1601 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 Feb 23 16:03:55 lab1601 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) Feb 23 16:03:55 lab1601 kernel: block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node Feb 23 16:03:55 lab1601 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Feb 23 16:03:55 lab1601 kernel: block drbd0: sock was shut down by peer Feb 23 16:03:55 lab1601 kernel: block drbd0: peer( Primary -> Unknown ) conn( WFBitMapS -> BrokenPipe ) pdsk( Consistent -> DUnknown ) Feb 23 16:03:55 lab1601 kernel: block drbd0: short read expecting header on sock: r=0 Feb 23 16:03:55 lab1601 kernel: block drbd0: meta connection shut down by peer. Feb 23 16:03:55 lab1601 kernel: block drbd0: asender terminated Feb 23 16:03:55 lab1601 kernel: block drbd0: Terminating asender thread Feb 23 16:03:55 lab1601 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Feb 23 16:03:55 lab1601 kernel: block drbd0: 100 KB (25 bits) marked out-of-sync by on disk bit-map. Feb 23 16:03:55 lab1601 kernel: block drbd0: Connection closed Feb 23 16:03:55 lab1601 kernel: block drbd0: conn( BrokenPipe -> Unconnected ) Feb 23 16:03:55 lab1601 kernel: block drbd0: receiver terminated Feb 23 16:03:55 lab1601 kernel: block drbd0: Restarting receiver thread Feb 23 16:03:55 lab1601 kernel: block drbd0: receiver (re)started Feb 23 16:03:55 lab1601 kernel: block drbd0: conn( Unconnected -> WFConnection ) -- global { usage-count no; } common { protocol C; startup { degr-wfc-timeout 3; #3 sec.. wfc-timeout 120; # 3 min. # become-primary-on both; } # end of startup handlers { } # end of handlers disk { on-io-error detach; no-disk-barrier; no-disk-flushes; no-md-flushes; } # end of disk net { timeout 60; # 6 seconds (unit = 0.1 seconds) connect-int 10; # 10 seconds (unit = 1 second) ping-int 10; # 10 seconds (unit = 1 second) ping-timeout 5; # 500 ms (unit = 0.1 seconds) cram-hmac-alg sha1; shared-secret "DRBD disk mirroring for BTP-R"; after-sb-0pri discard-older-primary; #after-sb-1pri discard-secondary; #after-sb-2pri violently-as0p; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; rr-conflict call-pri-lost; } # end of net } # end of commo resource mfs_drbd { syncer { rate 100M; } on lab1602 { device /dev/drbd0; disk /dev/vg0/mirror; address 10.203.230.136:7788; meta-disk /dev/vg0/drbdmeta[0]; #meta-disk internal; } on lab1601 { device /dev/drbd0; disk /dev/vg0/mirror; address 10.203.230.135:7788; meta-disk /dev/vg0/drbdmeta[0]; #meta-disk internal; } } #end of resource sfs_drbd Thanks in advance. vengatesh