[DRBD-user] Digest integrity check FAILED - Help tracking down the cause

Lars Ellenberg lars.ellenberg at linbit.com
Fri Sep 13 13:52:34 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Fri, Sep 13, 2013 at 10:47:17AM +0200, Martin Reissner wrote:
> Hello,
> 
> for some days now I've been getting these errors in the log every
> couple hours and I have a hard time figuring out where they come from.
> I know this is most likely not a DRBD issue as the setup has been
> running without problems for months and nothing has been changed. I
> don't know what else to try though, can someone on here maybe point me
> in the right direction?
> 
> I have a simple active/passive Setup running Mysql on Debian 6.0.7
> (Squeeze), DRBD Version is 8.3.7.
> 
> We tried running a manual Online Verify but each time it was aborted by
> the disconnect caused by the "Digest integrity check FAILED". Finally I
> disabled the "data-integrity-alg" Option and then the Verify completed
> without any errors.
> 
> I've had the Hardware (RAM,CPU,Disks) checked on both nodes to no avail
> and I also replaced the NICs for the Direct/Crosslink that is used by DRBD.
> 
> Following up are corresponding logs from mdb1-ha1 and mdb1-ha2, I will
> gladly provide further info if needed. FWIW, the setup is still running
> live without any issues and unless I turn on the "data-integrity-alg"
> the logs stay clean.

Do these threads help?

http://thread.gmane.org/gmane.linux.network.drbd/21223/focus=21391
http://thread.gmane.org/gmane.linux.network.drbd/22836/focus=22897
http://thread.gmane.org/gmane.linux.network.drbd/19409/focus=19426

And more ...

tl;dr:

*maybe* you have hardware problems.

*likely* you just have "normal" behaviour of "misbehaving" (from
the point of view of the storage subsystem) application/kernel.

Upgrading the kernel may help. Or not.

We should rename "data integrity" to
"calculate and double check message digests for diagnostic purposes
 and burn cpu as a side effect".

	Lars

> 
> Martin
> 
> ha1:
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.012133] block drbd1: Digest
> integrity check FAILED.
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.012167] block drbd1: error
> receiving Data, l: 4140!
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.012197] block drbd1: peer(
> Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate
> -> DUnknown )
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.012212] block drbd1: asender
> terminated
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.012215] block drbd1: Terminating
> drbd1_asender
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013179] block drbd1: Connection
> closed
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013182] block drbd1: conn(
> ProtocolError -> Unconnected )
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013185] block drbd1: receiver
> terminated
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013186] block drbd1: Restarting
> drbd1_receiver
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013188] block drbd1: receiver
> (re)started
> Sep  5 07:49:10 mdb1-ha1 kernel: [68271.013191] block drbd1: conn(
> Unconnected -> WFConnection )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177560] block drbd1: Handshake
> successful: Agreed network protocol version 91
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177566] block drbd1: conn(
> WFConnection -> WFReportParams )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177582] block drbd1: Starting
> asender thread (from drbd1_receiver [2032])
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177689] block drbd1:
> data-integrity-alg: sha1
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177753] block drbd1:
> drbd_sync_handshake:
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177757] block drbd1: self
> 095ABE2754A6CE94:0000000000000000:F0420ACD09464C04:704D31CBB5F812AF
> bits:0 flags:0
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177761] block drbd1: peer
> 90C3D267D663925D:095ABE2754A6CE95:F0420ACD09464C05:704D31CBB5F812AF
> bits:61 flags:0
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177765] block drbd1:
> uuid_compare()=-1 by rule 50
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.177770] block drbd1: peer(
> Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown
> -> UpToDate )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.442588] block drbd1: conn(
> WFBitMapT -> WFSyncUUID )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.445292] block drbd1: helper
> command: /sbin/drbdadm before-resync-target minor-1
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.446469] block drbd1: helper
> command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.446472] block drbd1: conn(
> WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.446476] block drbd1: Began
> resync as SyncTarget (will sync 244 KB [61 bits set]).
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.533948] block drbd1: Resync done
> (total 1 sec; paused 0 sec; 244 K/sec)
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.533957] block drbd1: conn(
> SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.533964] block drbd1: helper
> command: /sbin/drbdadm after-resync-target minor-1
> Sep  5 07:49:11 mdb1-ha1 kernel: [68272.554497] block drbd1: helper
> command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)
> 
> ha2:
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587564] block drbd1: sock was
> shut down by peer
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587567] block drbd1: meta
> connection shut down by peer.
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587572] block drbd1: peer(
> Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
> UpToDate -> DUnknown )
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587580] block drbd1: asender
> terminated
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587581] block drbd1:
> Terminating drbd1_asender
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587584] block drbd1: Creating
> new current UUID
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587593] block drbd1:
> sock_sendmsg returned -32
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587595] block drbd1: short
> sent ReportUUIDs size=56 sent=0
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587648] block drbd1: short
> read expecting header on sock: r=0
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587837] block drbd1:
> Connection closed
> Sep  5 07:49:10 mdb1-ha2 kernel: [32102358.587841] block drbd1: helper
> command: /sbin/drbdadm fence-peer minor-1
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.650659] block drbd1: helper
> command: /sbin/drbdadm fence-peer minor-1 exit code 4 (0x400)
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.650662] block drbd1:
> fence-peer helper returned 4 (peer was fenced)
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.650667] block drbd1: pdsk(
> DUnknown -> Outdated )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.655320] block drbd1: conn(
> NetworkFailure -> Unconnected )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.655326] block drbd1: receiver
> terminated
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.655327] block drbd1:
> Restarting drbd1_receiver
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.655329] block drbd1: receiver
> (re)started
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.655333] block drbd1: conn(
> Unconnected -> WFConnection )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.752623] block drbd1:
> Handshake successful: Agreed network protocol version 91
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.752630] block drbd1: conn(
> WFConnection -> WFReportParams )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.752644] block drbd1: Starting
> asender thread (from drbd1_receiver [1758])
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.752696] block drbd1:
> data-integrity-alg: sha1
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.757933] block drbd1:
> drbd_sync_handshake:
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.757937] block drbd1: self
> 90C3D267D663925D:095ABE2754A6CE95:F0420ACD09464C05:704D31CBB5F812AF
> bits:61 flags:0
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.757940] block drbd1: peer
> 095ABE2754A6CE94:0000000000000000:F0420ACD09464C04:704D31CBB5F812AF
> bits:0 flags:0
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.757942] block drbd1:
> uuid_compare()=1 by rule 70
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102359.757947] block drbd1: peer(
> Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk(
> Outdated -> UpToDate )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102360.020204] block drbd1: conn(
> WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102360.020212] block drbd1: Began
> resync as SyncSource (will sync 244 KB [61 bits set]).
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102360.109042] block drbd1: Resync
> done (total 1 sec; paused 0 sec; 244 K/sec)
> Sep  5 07:49:11 mdb1-ha2 kernel: [32102360.109047] block drbd1: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> 
> -- 
> Wavecon GmbH | Ludwigstraße 2 | 90763 Fuerth
> HR/HRN: 10780 | GF: Cemil Degirmenci
> Ust-ID: DE251398082| Fon +49 911 120 6581
> Fax: +49 911 212 923 3 | Web: wavecon.de
> Mail + Jabber: mreissner at wavecon.de
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list