Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear DRBD, I am continuing to try and work out the verify on 8.2.5 from the manual: http://www.drbd.org/users-guide/s-use-online-verify.html The verify works and finds bad blocks. However I can't seem to clear them using the suggested method of disconnect and connect. Further verify commands seem to add bad blocks (on my set-up), which I could only clear with a full invalidation. I also did a check on all three suggested hashing algorithms, and found they all showed the same load and time to complete. Which was also exactly the same load and time as a full invalidate. So I am not sure what the advantage of the verify is. With the current version on 8.2.5, it seems to promote the creation of bad blocks, does not allow them to clear, and have the same time and loading as a full resync. I'm hoping you can tell me that I am doing something completely wrong. I've included my setup details at the end of the email. -------------- For instance, I did a verify: _Secondary Log_ 08:42:52 drbd0: conn( Connected -> VerifyT ) 09:35:59 drbd0: Out of sync: start=628149776, size=8 (sectors) 10:59:59 drbd0: Online verify done (total 8226 sec; paused 0 sec; 98024 K/sec) 10:59:59 drbd0: Online verify found 1 4k block out of sync! 10:59:59 drbd0: helper command: /sbin/drbdadm out-of-sync 10:59:59 drbd0: Writing the whole bitmap. 10:59:59 drbd0: writing of bitmap took 13 jiffies 10:59:59 drbd0: 4 KB (1 bits) marked out-of-sync by on disk bit-map. 10:59:59 drbd0: conn( VerifyT -> Connected ) _Primary Log_ 08:42:52 drbd0: conn( Connected -> VerifyS ) 09:35:59 drbd0: Out of sync: start=628149776, size=8 (sectors) 10:59:59 drbd0: Online verify done (total 8226 sec; paused 0 sec; 98024 K/sec) 10:59:59 drbd0: Online verify found 1 4k block out of sync! 10:59:59 drbd0: helper command: /sbin/drbdadm out-of-sync 10:59:59 drbd0: Writing the whole bitmap. 10:59:59 drbd0: writing of bitmap took 21 jiffies 10:59:59 drbd0: 4 KB (1 bits) marked out-of-sync by on disk bit-map. 10:59:59 drbd0: conn( VerifyS -> Connected ) I can see it completed in 2h 17m, which is exactly by syncer rate (100M) in /etc/drbd.conf. This has found a bad block. I cleared this using the suggested method and did another verify: Primary: # drbdadm disconnect all # drbdadm connect all _Secondary Log_ 11:01:58 drbd0: peer( Primary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) 11:01:58 drbd0: Writing meta data super block now. 11:01:58 drbd0: meta connection shut down by peer. 11:01:58 drbd0: asender terminated 11:01:58 drbd0: Terminating asender thread 11:01:58 drbd0: tl_clear() 11:01:58 drbd0: Connection closed 11:01:58 drbd0: conn( TearDown -> Unconnected ) 11:01:58 drbd0: receiver terminated 11:01:58 drbd0: receiver (re)started 11:01:58 drbd0: conn( Unconnected -> WFConnection ) 11:02:09 drbd0: Handshake successful: Agreed network protocol version 88 11:02:09 drbd0: conn( WFConnection -> WFReportParams ) 11:02:09 drbd0: Starting asender thread (from drbd0_receiver [25884]) 11:02:09 drbd0: data-integrity-alg: <not-used> 11:02:09 drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 11:02:09 drbd0: Writing meta data super block now. 11:02:10 drbd0: conn( WFBitMapT -> WFSyncUUID ) 11:02:10 drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) 11:02:10 drbd0: Began resync as SyncTarget (will sync 4848 KB [1212 bits set]). 11:02:10 drbd0: Writing meta data super block now. 11:02:10 drbd0: Resync done (total 1 sec; paused 0 sec; 4848 K/sec) 11:02:10 drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 11:02:10 drbd0: Writing meta data super block now. _Primary Log_ 11:01:58 drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 11:01:58 drbd0: short read expecting header on sock: r=-512 11:01:58 drbd0: asender terminated 11:01:58 drbd0: Terminating asender thread 11:01:58 drbd0: Creating new current UUID 11:01:58 drbd0: Writing meta data super block now. 11:01:58 drbd0: tl_clear() 11:01:58 drbd0: Connection closed 11:01:58 drbd0: conn( Disconnecting -> StandAlone ) 11:01:58 drbd0: receiver terminated 11:01:58 drbd0: Terminating receiver thread 11:02:09 drbd0: conn( StandAlone -> Unconnected ) 11:02:09 drbd0: Starting receiver thread (from drbd0_worker [11685]) 11:02:09 drbd0: receiver (re)started 11:02:09 drbd0: conn( Unconnected -> WFConnection ) 11:02:09 drbd0: Handshake successful: Agreed network protocol version 88 11:02:09 drbd0: conn( WFConnection -> WFReportParams ) 11:02:09 drbd0: Starting asender thread (from drbd0_receiver [31085]) 11:02:09 drbd0: data-integrity-alg: <not-used> 11:02:09 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) 11:02:09 drbd0: Writing meta data super block now. 11:02:10 drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) 11:02:10 drbd0: Began resync as SyncSource (will sync 4848 KB [1212 bits set]). 11:02:10 drbd0: Writing meta data super block now. 11:02:10 drbd0: Resync done (total 1 sec; paused 0 sec; 4848 K/sec) 11:02:10 drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 11:02:10 drbd0: Writing meta data super block now. Then verified again: # drbdadm verify all _Secondary Log_ 11:02:19 drbd0: conn( Connected -> VerifyT ) 11:55:29 drbd0: Out of sync: start=628149776, size=8 (sectors) 12:09:22 drbd0: Out of sync: start=791838200, size=8 (sectors) 13:19:31 drbd0: Online verify done (total 8231 sec; paused 0 sec; 97964 K/sec) 13:19:31 drbd0: Online verify found 2 4k block out of sync! 13:19:31 drbd0: helper command: /sbin/drbdadm out-of-sync 13:19:31 drbd0: Writing the whole bitmap. 13:19:31 drbd0: writing of bitmap took 13 jiffies 13:19:31 drbd0: 8 KB (2 bits) marked out-of-sync by on disk bit-map. 13:19:31 drbd0: conn( VerifyT -> Connected ) _Primary Log_ 11:02:19 drbd0: conn( Connected -> VerifyS ) 11:55:29 drbd0: Out of sync: start=628149776, size=8 (sectors) 12:09:22 drbd0: Out of sync: start=791838200, size=8 (sectors) 13:19:31 drbd0: Online verify done (total 8231 sec; paused 0 sec; 97964 K/sec) 13:19:31 drbd0: Online verify found 2 4k block out of sync! 13:19:31 drbd0: helper command: /sbin/drbdadm out-of-sync 13:19:31 drbd0: Writing the whole bitmap. 13:19:31 drbd0: writing of bitmap took 20 jiffies 13:19:31 drbd0: 8 KB (2 bits) marked out-of-sync by on disk bit-map. 13:19:31 drbd0: conn( VerifyS -> Connected ) I now have two bad blocks where previously I had one. This was cleared by a invalidate: Secondary: # drbdadm invalidate all After completion a verification from my primary: # drbdadm verify all _Secondary Log_ 07:58:10 drbd0: conn( Connected -> VerifyT ) 10:15:17 drbd0: Online verify done (total 8226 sec; paused 0 sec; 98024 K/sec) 10:15:17 drbd0: conn( VerifyT -> Connected _Primary Log_ 07:58:10 drbd0: conn( Connected -> VerifyS ) 10:15:17 drbd0: Online verify done (total 8226 sec; paused 0 sec; 98024 K/sec) 10:15:17 drbd0: conn( VerifyS -> Connected ) However, a few more verification later using different hashing algorithms, and I am back where I was: _Secondary Log_ 14:28:59 drbd0: conn( Connected -> VerifyT ) 15:22:06 drbd0: Out of sync: start=628149776, size=8 (sectors) 15:35:47 drbd0: Out of sync: start=790351000, size=8 (sectors) 16:46:08 drbd0: Online verify done (total 8228 sec; paused 0 sec; 98000 K/sec) 16:46:08 drbd0: Online verify found 2 4k block out of sync! 16:46:08 drbd0: helper command: /sbin/drbdadm out-of-sync 16:46:08 drbd0: Writing the whole bitmap. 16:46:08 drbd0: drbd_bm_total_weight: (!b) in /usr/src/drbd-8.2.5/drbd/drbd_bitmap.c:429 16:46:08 drbd0: writing of bitmap took 13 jiffies 16:46:08 drbd0: 8 KB (2 bits) marked out-of-sync by on disk bit-map. 16:46:08 drbd0: conn( VerifyT -> Connected ) _Primary Log_ 14:28:59 drbd0: conn( Connected -> VerifyS ) 15:22:06 drbd0: Out of sync: start=628149776, size=8 (sectors) 15:35:47 drbd0: Out of sync: start=790351000, size=8 (sectors) 16:46:08 drbd0: Online verify done (total 8228 sec; paused 0 sec; 98000 K/sec) 16:46:08 drbd0: Online verify found 2 4k block out of sync! 16:46:08 drbd0: helper command: /sbin/drbdadm out-of-sync 16:46:08 drbd0: Writing the whole bitmap. 16:46:08 drbd0: drbd_bm_total_weight: (!b) in /usr/src/drbd-8.2.5/drbd/drbd_bitmap.c:429 16:46:08 drbd0: writing of bitmap took 20 jiffies 16:46:08 drbd0: 8 KB (2 bits) marked out-of-sync by on disk bit-map. 16:46:08 drbd0: conn( VerifyS -> Connected ) Not only have I got back my bad-blocks, I have acquiring a 'drbd_bm_total_weight' warning... I am not sure this is expected behaviour, but I am also not sure that it's not something I have done. If you have any advise, please let me know. Also, this is my comparison of the various algorithms: Check of 787,456 MB over 1000baseT limited to sync rate of 100 MB/sec. Raw disk benchmark by Bonnie++ at ~ 317MB/sec (input per block). +------------+-------+-------------+-----------+-----------+ | Algorithm | Time | Rate MB/sec | Pri. Load | Sec. Load | +------------+-------+-------------+-----------+-----------+ | md5 | 02:17 | 95.8 | 2.46 | 1.37 | | sha1 | 02:17 | 95.8 | 2.25 | 1.29 | | crc32c | 02:17 | 95.8 | 2.37 | 1.17 | | INVALIDATE | 02:16 | 96.5 | 2.64 | 1.24 | +------------+-------+-------------+-----------+-----------+ Regards, Ben common { net { max-buffers 40000; unplug-watermark 40000; max-epoch-size 16384; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 100M; al-extents 257; verify-alg md5; } startup { degr-wfc-timeout 120; } handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; pri-lost "echo dbms-04 pri-lost. Have a look at the log files. | mail -s 'DRBD Alert dbms-04' sysalerts at ..."; split-brain "echo dbms-04 split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' sysalerts at ..."; } } resource dbms-04 { protocol B; on { device /dev/drbd0; disk /dev/cciss/c0d0p4; address 192.168.95.18:7789; meta-disk /dev/cciss/c0d0p3 [0]; } on { device /dev/drbd0; disk /dev/cciss/c0d0p4; address 192.168.95.17:7788; meta-disk /dev/cciss/c0d0p3 [0]; } disk { on-io-error detach; size 769G; } } version: 8.2.5 (api:88/proto:86-88) GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by root at hp-tm-06, 2008-03-03 11:09:15 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- ns:0 nr:6404820 dw:6404820 dr:806354944 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:201539520 misses:49216 starving:0 dirty:0 changed:49216 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 ************************************************************************* This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley, Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17 Registered in England No: 02017435, Registered Address: Charter Court, Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE. *************************************************************************