[DRBD-user] tons of out-of-sync sectors detected

Eric Marin eric.marin at utc.fr
Wed Jul 30 15:35:08 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

as a follow-up to mails I wrote to this list on 2008.06.23 ["drbdadm verify all seems to produce 
false positives on ext3 and crash the server"] and 2008.07.09, where I said I had problems with the 
command 'drbdadm verify all' (it regularly found out-of-sync sectors without any apparent reason), 
today I've decided to finally enable integrity checking :
net { data-integrity-alg "crc32c"; }

Then, looking in /var/log/kern.log, I noticed lots of messages like these :
---------------------------------------------------------------------
Jul 30 14:37:53 ldap-a kernel: [779687.127945] drbd0: meta connection shut down by peer.
Jul 30 14:37:53 ldap-a kernel: [779687.127980] drbd0: peer( Secondary -> Unknown ) conn( Connected 
-> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jul 30 14:37:53 ldap-a kernel: [779687.128040] drbd0: asender terminated
Jul 30 14:37:53 ldap-a kernel: [779687.128064] drbd0: Terminating asender thread
Jul 30 14:37:53 ldap-a kernel: [779687.128108] drbd0: Creating new current UUID
Jul 30 14:37:53 ldap-a kernel: [779687.128147] drbd0: Writing meta data super block now.
Jul 30 14:37:53 ldap-a kernel: [779687.128196] drbd0: sock was reset by peer
Jul 30 14:37:53 ldap-a kernel: [779687.128223] drbd0: short read expecting header on sock: r=-104
Jul 30 14:37:53 ldap-a kernel: [779687.128269] drbd0: tl_clear()
Jul 30 14:37:53 ldap-a kernel: [779687.128425] drbd0: Connection closed
Jul 30 14:37:53 ldap-a kernel: [779687.128452] drbd0: helper command: /sbin/drbdadm outdate-peer
Jul 30 14:37:55 ldap-a kernel: [779688.330184] drbd0: outdate-peer helper returned 4 (peer is outdated)
Jul 30 14:37:55 ldap-a kernel: [779688.330224] drbd0: pdsk( DUnknown -> Outdated )
Jul 30 14:37:55 ldap-a kernel: [779688.330262] drbd0: Writing meta data super block now.
Jul 30 14:37:55 ldap-a kernel: [779688.330360] drbd0: conn( NetworkFailure -> Unconnected )
Jul 30 14:37:55 ldap-a kernel: [779688.330390] drbd0: receiver terminated
Jul 30 14:37:55 ldap-a kernel: [779688.330418] drbd0: receiver (re)started
Jul 30 14:37:55 ldap-a kernel: [779688.330445] drbd0: conn( Unconnected -> WFConnection )
Jul 30 14:37:55 ldap-a kernel: [779688.429930] drbd0: Handshake successful: Agreed network protocol 
version 88
Jul 30 14:37:55 ldap-a kernel: [779688.472427] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Jul 30 14:37:55 ldap-a kernel: [779688.472458] drbd0: conn( WFConnection -> WFReportParams )
Jul 30 14:37:55 ldap-a kernel: [779688.472493] drbd0: Starting asender thread (from drbd0_receiver 
[23077])
Jul 30 14:37:55 ldap-a kernel: [779688.472566] drbd0: data-integrity-alg: crc32c
Jul 30 14:37:55 ldap-a kernel: [779688.473066] drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS )
Jul 30 14:37:55 ldap-a kernel: [779688.473115] drbd0: Writing meta data super block now.
Jul 30 14:37:55 ldap-a kernel: [779688.479730] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated 
-> Inconsistent )
Jul 30 14:37:55 ldap-a kernel: [779688.479781] drbd1: aftr_isp( 0 -> 1 )
Jul 30 14:37:55 ldap-a kernel: [779688.479808] drbd0: Began resync as SyncSource (will sync 896 KB 
[224 bits set]).
Jul 30 14:37:55 ldap-a kernel: [779688.479855] drbd0: Writing meta data super block now.
Jul 30 14:37:55 ldap-a kernel: [779688.483240] drbd1: peer_isp( 0 -> 1 )
Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: Resync done (total 1 sec; paused 0 sec; 896 K/sec)
Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: conn( SyncSource -> Connected ) pdsk( 
Inconsistent -> UpToDate )
Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd1: aftr_isp( 1 -> 0 )
Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: Writing meta data super block now.
Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd1: peer_isp( 1 -> 0 )
Jul 30 14:38:59 ldap-a kernel: [779728.867597] drbd0: sock_sendmsg returned -104
Jul 30 14:38:59 ldap-a kernel: [779728.867632] drbd0: peer( Secondary -> Unknown ) conn( Connected 
-> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Jul 30 14:38:59 ldap-a kernel: [779728.867799] drbd0: Creating new current UUID
Jul 30 14:38:59 ldap-a kernel: [779728.867841] drbd0: Writing meta data super block now.
Jul 30 14:38:59 ldap-a kernel: [779728.867912] drbd0: meta connection shut down by peer.
Jul 30 14:38:59 ldap-a kernel: [779728.867942] drbd0: asender terminated
Jul 30 14:38:59 ldap-a kernel: [779728.867967] drbd0: Terminating asender thread
Jul 30 14:38:59 ldap-a kernel: [779728.870276] drbd0: sock was shut down by peer
Jul 30 14:38:59 ldap-a kernel: [779728.870305] drbd0: short read expecting header on sock: r=0
Jul 30 14:38:59 ldap-a kernel: [779728.870356] drbd0: tl_clear()
Jul 30 14:38:59 ldap-a kernel: [779728.870466] drbd0: Connection closed
Jul 30 14:38:59 ldap-a kernel: [779728.870491] drbd0: helper command: /sbin/drbdadm outdate-peer
Jul 30 14:39:01 ldap-a kernel: [779729.521649] drbd0: outdate-peer helper returned 4 (peer is outdated)
Jul 30 14:39:01 ldap-a kernel: [779729.521690] drbd0: pdsk( DUnknown -> Outdated )
Jul 30 14:39:01 ldap-a kernel: [779729.521755] drbd0: Writing meta data super block now.
Jul 30 14:39:01 ldap-a kernel: [779729.521849] drbd0: conn( BrokenPipe -> Unconnected )
Jul 30 14:39:01 ldap-a kernel: [779729.521878] drbd0: receiver terminated
Jul 30 14:39:01 ldap-a kernel: [779729.521903] drbd0: receiver (re)started
Jul 30 14:39:01 ldap-a kernel: [779729.521930] drbd0: conn( Unconnected -> WFConnection )
Jul 30 14:39:01 ldap-a kernel: [779729.558903] drbd0: Handshake successful: Agreed network protocol 
version 88
Jul 30 14:39:01 ldap-a kernel: [779729.563009] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Jul 30 14:39:01 ldap-a kernel: [779729.563047] drbd0: conn( WFConnection -> WFReportParams )
Jul 30 14:39:01 ldap-a kernel: [779729.563088] drbd0: Starting asender thread (from drbd0_receiver 
[23077])
Jul 30 14:39:01 ldap-a kernel: [779729.563158] drbd0: data-integrity-alg: crc32c
Jul 30 14:39:01 ldap-a kernel: [779729.563656] drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS )
Jul 30 14:39:01 ldap-a kernel: [779729.563705] drbd0: Writing meta data super block now.
Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated 
-> Inconsistent )
Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd1: aftr_isp( 0 -> 1 )
Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: Began resync as SyncSource (will sync 1936 KB 
[484 bits set]).
Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: Writing meta data super block now.
Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd1: peer_isp( 0 -> 1 )
Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: Resync done (total 1 sec; paused 0 sec; 1936 
K/sec)
Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: conn( SyncSource -> Connected ) pdsk( 
Inconsistent -> UpToDate )
Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd1: aftr_isp( 1 -> 0 )
Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: Writing meta data super block now.
Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd1: peer_isp( 1 -> 0 )
Jul 30 14:41:04 ldap-a kernel: [779796.779854] drbd0: meta connection shut down by peer.
(...)
---------------------------------------------------------------------

It seems data-integrity-alg does its job and notices out-of-sync errors, and proceeds to immediately 
resynchronize the two nodes. However I get LOTS of these messages (I'd say every two minutes on 
average). So I also disabled offloading on the network card (on both nodes) :
# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
# ethtool -K eth1 rx off tx off
# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

But I still get these messages !

I had updated the kernel to version : "linux-image-2.6.25-2-686-bigmem   2.6.25-6~bpo40+1" on 
2008.07.09 and haven't _noticed_ any corruption since then, even though DRBD seems to find lots and 
lots of errors. I really get the impression these are false positives... but should I worry ? What 
could I do now ?

Thanks for any suggestion !
Eric



More information about the drbd-user mailing list