Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, We've a 200Gb block device (EXT3) and I just discover something not good : when starting an online-verify (taking several hours) and encountering a "digest failed", it simply stop verify and never resume the check to the execution of "/sbin/drbdadm out-of-sync minor-0". Yesterday, the "slave" server (secondary for DRBD) has crash (RAID lose it's sync because of the triggered "fire alarm" ...). So, I reinstalled the machine with exactly the same partition size. On the first connect, DRBD resync ... (and it doesn't seen many "hole" in the block device from "slave"). It has just sync about 10Gb out of 200Gb. So, I ran an online-verify and found many block "out-of-sync" : ---------------- [...] block drbd0: Out of sync: start=115060248, size=1080 (sectors) block drbd0: Out of sync: start=115315200, size=5536 (sectors) block drbd0: Out of sync: start=115320736, size=2672 (sectors) block drbd0: Out of sync: start=115577280, size=8208 (sectors) block drbd0: Out of sync: start=115839360, size=96 (sectors) block drbd0: Out of sync: start=115845832, size=1736 (sectors) block drbd0: Out of sync: start=116108120, size=1528 (sectors) block drbd0: Out of sync: start=116363808, size=328 (sectors) block drbd0: Out of sync: start=116370264, size=168 (sectors) block drbd0: Out of sync: start=116370432, size=1296 (sectors) block drbd0: Out of sync: start=116625952, size=328 (sectors) [...] ---------------- This first check, after reinstall, hit the end : ---------------- block drbd0: Online verify done (total 10354 sec; paused 0 sec; 20276 K/sec) block drbd0: Online verify found 1051516 4k block out of sync! ---------------- So, I should disconnect/connect (but it do it automatically because of a "integrity failed" : I was not at the office in the evening ...) : --------------- block drbd0: conn( VerifyS -> Connected ) block drbd0: Writing the whole bitmap, due to failed kmalloc block drbd0: helper command: /sbin/drbdadm out-of-sync minor-0 block drbd0: helper command: /sbin/drbdadm out-of-sync minor-0 exit code 0 (0x0) block drbd0: 4107 MB (1051516 bits) marked out-of-sync by on disk bit-map. block drbd0: Digest integrity check FAILED. block drbd0: error receiving Data, l: 4124! block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) block drbd0: asender terminated block drbd0: Terminating asender thread block drbd0: Connection closed block drbd0: conn( ProtocolError -> Unconnected ) block drbd0: receiver terminated block drbd0: Restarting receiver thread [...] --------------- So, finally, I got a DRBD (secondary) UpToDate by transferring about 10+4Gb => 14Gb out of 200Gb. Here, I got no problem. I add a cronjob to start automatically an online-verify every week. I set the "out-of-sync" handler to send mail in case of block not synced. To check if the mail work, I've stopped DRBD in the secondary node, write 1Mo (with "dd") at the beginning of the physical block device and start DRBD again => sync to get last changes : ------------- block drbd0: Began resync as SyncTarget (will sync 2276 KB [569 bits set]). block drbd0: Resync done (total 1 sec; paused 0 sec; 2276 K/sec) block drbd0: 0 % had equal check sums, eliminated: 8K; transferred 2268K total 2276K block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) ------------- Fine, it has synced data which was not yet synced from the master ... but probably have not seen the megabyte at the beginning ... So I ran an online-verify : -------------- block drbd0: conn( Connected -> VerifyS ) block drbd0: Starting Online Verify from sector 0 block drbd0: Out of sync: start=16, size=88 (sectors) block drbd0: Out of sync: start=1856, size=144 (sectors) block drbd0: Out of sync: start=832, size=1024 (sectors) block drbd0: Digest integrity check FAILED. block drbd0: error receiving Data, l: 4124! block drbd0: peer( Primary -> Unknown ) conn( VerifyS -> ProtocolError ) pdsk( UpToDate -> DUnknown ) block drbd0: Online Verify reached sector 410987376 -------------- The 3 lines of "out-of-sync" are probably the megabyte not synced ... but, it doesn't look for synchronization problem on the entire disk because of the "Digest integrity check FAILED" ... It stops and nothing has been resumed : ------------- block drbd0: asender terminated block drbd0: Terminating asender thread block drbd0: Connection closed block drbd0: conn( ProtocolError -> Unconnected ) block drbd0: receiver terminated block drbd0: Restarting receiver thread block drbd0: receiver (re)started block drbd0: conn( Unconnected -> WFConnection ) block drbd0: Handshake successful: Agreed network protocol version 94 block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC block drbd0: conn( WFConnection -> WFReportParams ) block drbd0: Starting asender thread (from drbd0_receiver [10746]) block drbd0: data-integrity-alg: crc32c block drbd0: drbd_sync_handshake: block drbd0: self 78E156BE5056B6C0:0000000000000000:8FEE5C54952F6D3A:93BFAFAE999F4FB5 bits:236 flags:0 block drbd0: peer B43A6F45607420FF:78E156BE5056B6C1:8FEE5C54952F6D3B:93BFAFAE999F4FB5 bits:1322 flags:0 block drbd0: uuid_compare()=-1 by rule 50 block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 240(1), total 240; compression: 100.0% block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 240(1), total 240; compression: 100.0% block drbd0: conn( WFBitMapT -> WFSyncUUID ) block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) block drbd0: Began resync as SyncTarget (will sync 5292 KB [1323 bits set]). block drbd0: Resync done (total 1 sec; paused 0 sec; 5292 K/sec) block drbd0: 5 % had equal check sums, eliminated: 316K; transferred 4976K total 5292K block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) block drbd0: Writing the whole bitmap, due to failed kmalloc block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) block drbd0: Digest integrity check FAILED. block drbd0: error receiving Data, l: 4124! block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) block drbd0: asender terminated block drbd0: Terminating asender thread block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. block drbd0: Connection closed block drbd0: conn( ProtocolError -> Unconnected ) block drbd0: receiver terminated block drbd0: Restarting receiver thread block drbd0: receiver (re)started block drbd0: conn( Unconnected -> WFConnection ) block drbd0: Handshake successful: Agreed network protocol version 94 block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC block drbd0: conn( WFConnection -> WFReportParams ) block drbd0: Starting asender thread (from drbd0_receiver [10746]) block drbd0: data-integrity-alg: crc32c block drbd0: drbd_sync_handshake: block drbd0: self B43A6F45607420FE:0000000000000000:EF6F5082B813AA14:78E156BE5056B6C1 bits:0 flags:0 block drbd0: peer 6A5625F0DA1110E7:B43A6F45607420FF:EF6F5082B813AA15:78E156BE5056B6C1 bits:33 flags:0 block drbd0: uuid_compare()=-1 by rule 50 block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 45(1), total 45; compression: 100.0% block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 45(1), total 45; compression: 100.0% block drbd0: conn( WFBitMapT -> WFSyncUUID ) block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) block drbd0: Began resync as SyncTarget (will sync 132 KB [33 bits set]). block drbd0: Resync done (total 1 sec; paused 0 sec; 132 K/sec) block drbd0: 21 % had equal check sums, eliminated: 28K; transferred 104K total 132K block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) ------------- So, because of the digest fails, it will never call the "/sbin/drbdadm out-of-sync minor-0" and no mail will be send to admins to tell them than some blocks were "out-of-sync" ... My problem is that I just set 1 check per week and if it can't do correctly the job, it's not good :( : - I will not be warned about problem of "out-of-sync" - I can't be sure than DRBD has check the whole partition - I can't be sure than all my data are correctly UpToDate (replicated) So, this is critical for me :-/ What can I do to avoid this ? How can I be sure than the new DRBD-slave (freshly re-installed) has no corrupted data ? Here are some informations about our systems : - RHEL 5.5 - CentOS Extra repository The slave : -------- drbd: initialized. Version: 8.3.8 (api:88/proto:86-94) drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild at builder10.centos.org, 2010-06-04 08:04:09 drbd: registered as block device major 147 drbd: minor_table @ 0xffff81027c0625c0 -------- The master : -------- drbd: initialized. Version: 8.3.8 (api:88/proto:86-94) drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild at builder10.centos.org, 2010-06-04 08:04:09 drbd: registered as block device major 147 drbd: minor_table @ 0xffff8102b91af3c0 -------- Configuration : ------------ global { #don't send statistics through internet ... usage-count no; } common { protocol C; syncer { #replication speed rate 20M; #use compression for bitmap exchange use-rle; #set the "on-line device verification" algorithm (should be triggered by a cronjob) verify-alg md5; #set the "checksum-based synchronization" algorithm (used when synchronizing) csums-alg crc32c; #tunning the activity log size #default is 127 ; increment it when using intensive I/O (write lot of small file) al-extents 3389; } } resource data-integration { device /dev/drbd0; meta-disk internal; handlers { #send mail for these events pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh root"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh root"; pri-lost "/usr/lib/drbd/notify-pri-lost.sh root"; local-io-error "/usr/lib/drbd/notify-io-error.sh root"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; #no notification script for these handlers or don't want to work ... #before-resync-target "/usr/lib/drbd/"; #after-resync-target "/usr/lib/drbd/"; #initial-split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } disk { #use "none" for write-after-write (because, we've got battery for our server) no-disk-barrier; no-disk-flushes; no-md-flushes; #should NOT be used ? only under special circumstances ? But, we don't need to write in order ... so disable it no-disk-drain; #on I/O error, detach disk and use the remote peer disk on-io-error detach; } net { #authentication cram-hmac-alg sha1; shared-secret "FDQrqdsfe456GFfgssdrf34"; #2x primary to use GFS ("rw/ro" or "rw/rw") #this is not needed if you want to use "ext3/ext4" filesystem ("rw/--" only) ##allow-two-primaries; #set the "replication traffic integrity checking" algorithm (used when replicating) data-integrity-alg crc32c; #split-brain (node = secondary/primary/both-primary ; discard if no change/disconnect/disconnect) #do nothing => disconnect after-sb-0pri discard-zero-changes; after-sb-1pri disconnect; after-sb-2pri disconnect; #tuning recommendations (for RAID controler) max-buffers 8000; max-epoch-size 8000; } startup { #dont wait infinitely (cause stuck on boot if not set) wfc-timeout 15; degr-wfc-timeout 15; #when starting DRBD service, set one node to "primary" (<node_name> or "both") become-primary-on <machine_name>; } on <machine_name> { address <IP>:7788; disk /dev/sda5; } on <machine_name> { address <IP>:7788; disk /dev/sda5; } } ------------ In fact, I'm very afraid about the possible corrupted replication data on the master ... How can I start a new replication from the beginning and be sure to get really not corrupted data and not just a "UpToDate" .. ? Thanks you :) -- View this message in context: http://old.nabble.com/DRBD-Online-Verify-stop-after-a-digest-integrity-check-failed-tp29471512p29471512.html Sent from the DRBD - User mailing list archive at Nabble.com.