Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, as a follow-up to mails I wrote to this list on 2008.06.23 ["drbdadm verify all seems to produce false positives on ext3 and crash the server"] and 2008.07.09, where I said I had problems with the command 'drbdadm verify all' (it regularly found out-of-sync sectors without any apparent reason), today I've decided to finally enable integrity checking : net { data-integrity-alg "crc32c"; } Then, looking in /var/log/kern.log, I noticed lots of messages like these : --------------------------------------------------------------------- Jul 30 14:37:53 ldap-a kernel: [779687.127945] drbd0: meta connection shut down by peer. Jul 30 14:37:53 ldap-a kernel: [779687.127980] drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jul 30 14:37:53 ldap-a kernel: [779687.128040] drbd0: asender terminated Jul 30 14:37:53 ldap-a kernel: [779687.128064] drbd0: Terminating asender thread Jul 30 14:37:53 ldap-a kernel: [779687.128108] drbd0: Creating new current UUID Jul 30 14:37:53 ldap-a kernel: [779687.128147] drbd0: Writing meta data super block now. Jul 30 14:37:53 ldap-a kernel: [779687.128196] drbd0: sock was reset by peer Jul 30 14:37:53 ldap-a kernel: [779687.128223] drbd0: short read expecting header on sock: r=-104 Jul 30 14:37:53 ldap-a kernel: [779687.128269] drbd0: tl_clear() Jul 30 14:37:53 ldap-a kernel: [779687.128425] drbd0: Connection closed Jul 30 14:37:53 ldap-a kernel: [779687.128452] drbd0: helper command: /sbin/drbdadm outdate-peer Jul 30 14:37:55 ldap-a kernel: [779688.330184] drbd0: outdate-peer helper returned 4 (peer is outdated) Jul 30 14:37:55 ldap-a kernel: [779688.330224] drbd0: pdsk( DUnknown -> Outdated ) Jul 30 14:37:55 ldap-a kernel: [779688.330262] drbd0: Writing meta data super block now. Jul 30 14:37:55 ldap-a kernel: [779688.330360] drbd0: conn( NetworkFailure -> Unconnected ) Jul 30 14:37:55 ldap-a kernel: [779688.330390] drbd0: receiver terminated Jul 30 14:37:55 ldap-a kernel: [779688.330418] drbd0: receiver (re)started Jul 30 14:37:55 ldap-a kernel: [779688.330445] drbd0: conn( Unconnected -> WFConnection ) Jul 30 14:37:55 ldap-a kernel: [779688.429930] drbd0: Handshake successful: Agreed network protocol version 88 Jul 30 14:37:55 ldap-a kernel: [779688.472427] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jul 30 14:37:55 ldap-a kernel: [779688.472458] drbd0: conn( WFConnection -> WFReportParams ) Jul 30 14:37:55 ldap-a kernel: [779688.472493] drbd0: Starting asender thread (from drbd0_receiver [23077]) Jul 30 14:37:55 ldap-a kernel: [779688.472566] drbd0: data-integrity-alg: crc32c Jul 30 14:37:55 ldap-a kernel: [779688.473066] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Jul 30 14:37:55 ldap-a kernel: [779688.473115] drbd0: Writing meta data super block now. Jul 30 14:37:55 ldap-a kernel: [779688.479730] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) Jul 30 14:37:55 ldap-a kernel: [779688.479781] drbd1: aftr_isp( 0 -> 1 ) Jul 30 14:37:55 ldap-a kernel: [779688.479808] drbd0: Began resync as SyncSource (will sync 896 KB [224 bits set]). Jul 30 14:37:55 ldap-a kernel: [779688.479855] drbd0: Writing meta data super block now. Jul 30 14:37:55 ldap-a kernel: [779688.483240] drbd1: peer_isp( 0 -> 1 ) Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: Resync done (total 1 sec; paused 0 sec; 896 K/sec) Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd1: aftr_isp( 1 -> 0 ) Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd0: Writing meta data super block now. Jul 30 14:37:56 ldap-a kernel: [779688.488330] drbd1: peer_isp( 1 -> 0 ) Jul 30 14:38:59 ldap-a kernel: [779728.867597] drbd0: sock_sendmsg returned -104 Jul 30 14:38:59 ldap-a kernel: [779728.867632] drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Jul 30 14:38:59 ldap-a kernel: [779728.867799] drbd0: Creating new current UUID Jul 30 14:38:59 ldap-a kernel: [779728.867841] drbd0: Writing meta data super block now. Jul 30 14:38:59 ldap-a kernel: [779728.867912] drbd0: meta connection shut down by peer. Jul 30 14:38:59 ldap-a kernel: [779728.867942] drbd0: asender terminated Jul 30 14:38:59 ldap-a kernel: [779728.867967] drbd0: Terminating asender thread Jul 30 14:38:59 ldap-a kernel: [779728.870276] drbd0: sock was shut down by peer Jul 30 14:38:59 ldap-a kernel: [779728.870305] drbd0: short read expecting header on sock: r=0 Jul 30 14:38:59 ldap-a kernel: [779728.870356] drbd0: tl_clear() Jul 30 14:38:59 ldap-a kernel: [779728.870466] drbd0: Connection closed Jul 30 14:38:59 ldap-a kernel: [779728.870491] drbd0: helper command: /sbin/drbdadm outdate-peer Jul 30 14:39:01 ldap-a kernel: [779729.521649] drbd0: outdate-peer helper returned 4 (peer is outdated) Jul 30 14:39:01 ldap-a kernel: [779729.521690] drbd0: pdsk( DUnknown -> Outdated ) Jul 30 14:39:01 ldap-a kernel: [779729.521755] drbd0: Writing meta data super block now. Jul 30 14:39:01 ldap-a kernel: [779729.521849] drbd0: conn( BrokenPipe -> Unconnected ) Jul 30 14:39:01 ldap-a kernel: [779729.521878] drbd0: receiver terminated Jul 30 14:39:01 ldap-a kernel: [779729.521903] drbd0: receiver (re)started Jul 30 14:39:01 ldap-a kernel: [779729.521930] drbd0: conn( Unconnected -> WFConnection ) Jul 30 14:39:01 ldap-a kernel: [779729.558903] drbd0: Handshake successful: Agreed network protocol version 88 Jul 30 14:39:01 ldap-a kernel: [779729.563009] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jul 30 14:39:01 ldap-a kernel: [779729.563047] drbd0: conn( WFConnection -> WFReportParams ) Jul 30 14:39:01 ldap-a kernel: [779729.563088] drbd0: Starting asender thread (from drbd0_receiver [23077]) Jul 30 14:39:01 ldap-a kernel: [779729.563158] drbd0: data-integrity-alg: crc32c Jul 30 14:39:01 ldap-a kernel: [779729.563656] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Jul 30 14:39:01 ldap-a kernel: [779729.563705] drbd0: Writing meta data super block now. Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd1: aftr_isp( 0 -> 1 ) Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: Began resync as SyncSource (will sync 1936 KB [484 bits set]). Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd0: Writing meta data super block now. Jul 30 14:39:01 ldap-a kernel: [779729.566904] drbd1: peer_isp( 0 -> 1 ) Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: Resync done (total 1 sec; paused 0 sec; 1936 K/sec) Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd1: aftr_isp( 1 -> 0 ) Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd0: Writing meta data super block now. Jul 30 14:39:01 ldap-a kernel: [779729.578905] drbd1: peer_isp( 1 -> 0 ) Jul 30 14:41:04 ldap-a kernel: [779796.779854] drbd0: meta connection shut down by peer. (...) --------------------------------------------------------------------- It seems data-integrity-alg does its job and notices out-of-sync errors, and proceeds to immediately resynchronize the two nodes. However I get LOTS of these messages (I'd say every two minutes on average). So I also disabled offloading on the network card (on both nodes) : # ethtool -k eth1 Offload parameters for eth1: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off # ethtool -K eth1 rx off tx off # ethtool -k eth1 Offload parameters for eth1: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off But I still get these messages ! I had updated the kernel to version : "linux-image-2.6.25-2-686-bigmem 2.6.25-6~bpo40+1" on 2008.07.09 and haven't _noticed_ any corruption since then, even though DRBD seems to find lots and lots of errors. I really get the impression these are false positives... but should I worry ? What could I do now ? Thanks for any suggestion ! Eric