[DRBD-user] DRBD Online Verify stop after a digest integrity check failed

Wed Aug 18 15:50:31 CEST 2010

Hello, 

We've a 200Gb block device (EXT3) and I just discover something not good :
when starting an online-verify (taking several hours) and encountering a
"digest failed", it simply stop verify and never resume the check to the
execution of "/sbin/drbdadm out-of-sync minor-0".

Yesterday, the "slave" server (secondary for DRBD) has crash (RAID lose it's
sync because of the triggered "fire alarm" ...). So, I reinstalled the
machine with exactly the same partition size. 

On the first connect, DRBD resync ... (and it doesn't seen many "hole" in
the block device from "slave"). It has just sync about 10Gb out of 200Gb. 
So, I ran an online-verify and found many block "out-of-sync" :
----------------
[...]
block drbd0: Out of sync: start=115060248, size=1080 (sectors)
block drbd0: Out of sync: start=115315200, size=5536 (sectors)
block drbd0: Out of sync: start=115320736, size=2672 (sectors)
block drbd0: Out of sync: start=115577280, size=8208 (sectors)
block drbd0: Out of sync: start=115839360, size=96 (sectors)
block drbd0: Out of sync: start=115845832, size=1736 (sectors)
block drbd0: Out of sync: start=116108120, size=1528 (sectors)
block drbd0: Out of sync: start=116363808, size=328 (sectors)
block drbd0: Out of sync: start=116370264, size=168 (sectors)
block drbd0: Out of sync: start=116370432, size=1296 (sectors)
block drbd0: Out of sync: start=116625952, size=328 (sectors)
[...]
----------------

This first check, after reinstall, hit the end : 
----------------
block drbd0: Online verify  done (total 10354 sec; paused 0 sec; 20276
K/sec)
block drbd0: Online verify found 1051516 4k block out of sync!
----------------

So, I should disconnect/connect (but it do it automatically because of a
"integrity failed" : I was not at the office in the evening ...) :
---------------
block drbd0: conn( VerifyS -> Connected )
block drbd0: Writing the whole bitmap, due to failed kmalloc
block drbd0: helper command: /sbin/drbdadm out-of-sync minor-0
block drbd0: helper command: /sbin/drbdadm out-of-sync minor-0 exit code 0
(0x0)
block drbd0: 4107 MB (1051516 bits) marked out-of-sync by on disk bit-map.
block drbd0: Digest integrity check FAILED.
block drbd0: error receiving Data, l: 4124!
block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError )
pdsk( UpToDate -> DUnknown )
block drbd0: asender terminated
block drbd0: Terminating asender thread
block drbd0: Connection closed
block drbd0: conn( ProtocolError -> Unconnected )
block drbd0: receiver terminated
block drbd0: Restarting receiver thread
[...]
---------------

So, finally, I got a DRBD (secondary) UpToDate by transferring about 10+4Gb
=> 14Gb out of 200Gb.

Here, I got no problem.

I add a cronjob to start automatically an online-verify every week. I set
the "out-of-sync" handler to send mail in case of block not synced. To check
if the mail work, I've stopped DRBD in the secondary node, write 1Mo (with
"dd") at the beginning of the physical block device and start DRBD again =>
sync to get last changes :
-------------
block drbd0: Began resync as SyncTarget (will sync 2276 KB [569 bits set]).
block drbd0: Resync done (total 1 sec; paused 0 sec; 2276 K/sec)
block drbd0: 0 % had equal check sums, eliminated: 8K; transferred 2268K
total 2276K
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate
)
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit
code 0 (0x0)
-------------

Fine, it has synced data which was not yet synced from the master ... but
probably have not seen the megabyte at the beginning ... So I ran an
online-verify :
--------------
block drbd0: conn( Connected -> VerifyS )
block drbd0: Starting Online Verify from sector 0
block drbd0: Out of sync: start=16, size=88 (sectors)
block drbd0: Out of sync: start=1856, size=144 (sectors)
block drbd0: Out of sync: start=832, size=1024 (sectors)
block drbd0: Digest integrity check FAILED.
block drbd0: error receiving Data, l: 4124!
block drbd0: peer( Primary -> Unknown ) conn( VerifyS -> ProtocolError )
pdsk( UpToDate -> DUnknown )
block drbd0: Online Verify reached sector 410987376
--------------

The 3 lines of "out-of-sync" are probably the megabyte not synced ... but,
it doesn't look for synchronization problem on the entire disk because of
the "Digest integrity check FAILED" ... It stops and nothing has been
resumed :
-------------
block drbd0: asender terminated
block drbd0: Terminating asender thread
block drbd0: Connection closed
block drbd0: conn( ProtocolError -> Unconnected )
block drbd0: receiver terminated
block drbd0: Restarting receiver thread
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 94
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [10746])
block drbd0: data-integrity-alg: crc32c
block drbd0: drbd_sync_handshake:
block drbd0: self
78E156BE5056B6C0:0000000000000000:8FEE5C54952F6D3A:93BFAFAE999F4FB5 bits:236
flags:0
block drbd0: peer
B43A6F45607420FF:78E156BE5056B6C1:8FEE5C54952F6D3B:93BFAFAE999F4FB5
bits:1322 flags:0
block drbd0: uuid_compare()=-1 by rule 50
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT )
pdsk( DUnknown -> UpToDate )
block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 240(1),
total 240; compression: 100.0%
block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 240(1),
total 240; compression: 100.0%
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit
code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent
)
block drbd0: Began resync as SyncTarget (will sync 5292 KB [1323 bits set]).
block drbd0: Resync done (total 1 sec; paused 0 sec; 5292 K/sec)
block drbd0: 5 % had equal check sums, eliminated: 316K; transferred 4976K
total 5292K
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate
)
block drbd0: Writing the whole bitmap, due to failed kmalloc
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit
code 0 (0x0)
block drbd0: Digest integrity check FAILED.
block drbd0: error receiving Data, l: 4124!
block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError )
pdsk( UpToDate -> DUnknown )
block drbd0: asender terminated
block drbd0: Terminating asender thread
block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd0: Connection closed
block drbd0: conn( ProtocolError -> Unconnected )
block drbd0: receiver terminated
block drbd0: Restarting receiver thread
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 94
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [10746])
block drbd0: data-integrity-alg: crc32c
block drbd0: drbd_sync_handshake:
block drbd0: self
B43A6F45607420FE:0000000000000000:EF6F5082B813AA14:78E156BE5056B6C1 bits:0
flags:0
block drbd0: peer
6A5625F0DA1110E7:B43A6F45607420FF:EF6F5082B813AA15:78E156BE5056B6C1 bits:33
flags:0
block drbd0: uuid_compare()=-1 by rule 50
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT )
pdsk( DUnknown -> UpToDate )
block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 45(1),
total 45; compression: 100.0%
block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 45(1),
total 45; compression: 100.0%
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit
code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent
)
block drbd0: Began resync as SyncTarget (will sync 132 KB [33 bits set]).
block drbd0: Resync done (total 1 sec; paused 0 sec; 132 K/sec)
block drbd0: 21 % had equal check sums, eliminated: 28K; transferred 104K
total 132K
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate
)
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit
code 0 (0x0)
-------------

So, because of the digest fails, it will never call the "/sbin/drbdadm
out-of-sync minor-0" and no mail will be send to admins to tell them than
some blocks were "out-of-sync" ...

My problem is that I just set 1 check per week and if it can't do correctly
the job, it's not good :( :
- I will not be warned about problem of "out-of-sync"
- I can't be sure than DRBD has check the whole partition
- I can't be sure than all my data are correctly UpToDate (replicated)

So, this is critical for me :-/

What can I do to avoid this ? How can I be sure than the new DRBD-slave
(freshly re-installed) has no corrupted data ?

Here are some informations about our systems :
- RHEL 5.5
- CentOS Extra repository

The slave :
--------
drbd: initialized. Version: 8.3.8 (api:88/proto:86-94)
drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild at builder10.centos.org, 2010-06-04 08:04:09
drbd: registered as block device major 147
drbd: minor_table @ 0xffff81027c0625c0
--------

The master :
--------
drbd: initialized. Version: 8.3.8 (api:88/proto:86-94)
drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild at builder10.centos.org, 2010-06-04 08:04:09
drbd: registered as block device major 147
drbd: minor_table @ 0xffff8102b91af3c0
--------

Configuration :
------------
global {
        #don't send statistics through internet ...
        usage-count no;
}

common {
        protocol C;

        syncer {
                #replication speed
                rate 20M;

                #use compression for bitmap exchange
                use-rle;

                #set the "on-line device verification" algorithm (should be
triggered by a cronjob)
                verify-alg md5;

                #set the "checksum-based synchronization" algorithm (used
when synchronizing)
                csums-alg crc32c;

                #tunning the activity log size
                #default is 127 ; increment it when using intensive I/O
(write lot of small file)
                al-extents 3389;
        }
}

resource data-integration {
        device /dev/drbd0;
        meta-disk internal;

        handlers {
                #send mail for these events
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh
root";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh
root";
                pri-lost "/usr/lib/drbd/notify-pri-lost.sh root";
                local-io-error "/usr/lib/drbd/notify-io-error.sh root";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";

                #no notification script for these handlers or don't want to
work ...
                #before-resync-target "/usr/lib/drbd/";
                #after-resync-target "/usr/lib/drbd/";
                #initial-split-brain "/usr/lib/drbd/notify-split-brain.sh
root";
        }

        disk {
                #use "none" for write-after-write (because, we've got
battery for our server)
                no-disk-barrier;
                no-disk-flushes;
                no-md-flushes;
                #should NOT be used ? only under special circumstances ?
But, we don't need to write in order ... so disable it
                no-disk-drain;

                #on I/O error, detach disk and use the remote peer disk
                on-io-error detach;
        }

        net {
                #authentication
                cram-hmac-alg sha1;
                shared-secret "FDQrqdsfe456GFfgssdrf34";

                #2x primary to use GFS ("rw/ro" or "rw/rw")
                #this is not needed if you want to use "ext3/ext4"
filesystem ("rw/--" only)
                ##allow-two-primaries;

                #set the "replication traffic integrity checking" algorithm
(used when replicating)
                data-integrity-alg crc32c;

                #split-brain (node = secondary/primary/both-primary ;
discard if no change/disconnect/disconnect)
                #do nothing => disconnect
                after-sb-0pri discard-zero-changes;
                after-sb-1pri disconnect;
                after-sb-2pri disconnect;

                #tuning recommendations (for RAID controler)
                max-buffers 8000;
                max-epoch-size 8000;
        }

        startup {
                #dont wait infinitely (cause stuck on boot if not set)
                wfc-timeout 15;
                degr-wfc-timeout 15;

                #when starting DRBD service, set one node to "primary"
(<node_name> or "both")
                become-primary-on <machine_name>;
        }

        on <machine_name> {
                address         <IP>:7788;
                disk            /dev/sda5;
        }
        on <machine_name> {
                address         <IP>:7788;
                disk            /dev/sda5;
        }
}
------------

In fact, I'm very afraid about the possible corrupted replication data on
the master ... How can I start a new replication from the beginning and be
sure to get really not corrupted data and not just a "UpToDate" .. ?

Thanks you :)
-- 
View this message in context: http://old.nabble.com/DRBD-Online-Verify-stop-after-a-digest-integrity-check-failed-tp29471512p29471512.html
Sent from the DRBD - User mailing list archive at Nabble.com.