[DRBD-user] Digest integrity check failed

Lars Ellenberg lars.ellenberg at linbit.com
Wed Nov 16 13:12:53 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Nov 15, 2011 at 10:36:22PM -1100, Nick Morrison wrote:
> 
> All,
> 
> Further to my previous messages, and just in case someone's solutioned this already...
> 
> Having disabled checksum offloading on both servers, swapped the
> cable, and swapped NICs (same set of 4 onboard NICs, but at least a
> different port) I am unfortunately still getting the below messages in
> my logs fairly regularly.  I understand that the digest integrity
> check is a pretty basic thing and unlikely to be a bug in DRBD or the
> kernel md5 code.  Maybe a hardware fault.  Maybe a bnx2 driver bug.

Maybe simply _buffers modified in flight_.
Some access patterns just do that, and the "stable pages" (file systems
trying harder to prevent that from happening) are not yet in your kernel.

More recent DRBD will log an additional message,
 "Digest mismatch, buffer modified by upper layers during write:"
(on the sending, Primary side) if it detects that this has happened.

Upgrade to recent drbd 8.3,
and if all your digest-mismatch errors happen to log the "buffers
modified" message as well, simply disable digest integrity check.

We generally recommend to disable that in production anyways.
It is more a diagnostict tool than a protection feature.

> If I can't pinpoint the problem, I'll get a non-broadcom PCI NIC for each machine and try again.
> 
> Some gory details follow:
> 
> The servers are (new) Dell R710s, with Broadcom NICs.  I'm running
> ubuntu 10.04.3 LTS, Linux kvm-host-02 2.6.32-35-server #78-Ubuntu SMP
> Tue Oct 11 16:26:12 UTC 2011 x86_64 GNU/Linux
> 
> drbd: Version: 8.3.7 (api:88)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by buildd at yellow, 2011-07-22 17:37:28
> 
> eth2 is connected directly to eth2 via a cat5e cable.
> 
> 
> $ sudo ethtool -i eth2
> driver: bnx2
> version: 2.0.2
> firmware-version: 5.2.3 NCSI 2.0.11
> bus-info: 0000:02:00.0
> $ 
> 
> 
> eth2      Link encap:Ethernet  HWaddr bc:30:5b:e2:1a:cd  
>           inet addr:172.16.1.2  Bcast:172.16.1.255  Mask:255.255.255.0
>           inet6 addr: fe80::be30:5bff:fee2:1acd/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:45777422 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:51952054 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000 
>           RX bytes:29580717904 (29.5 GB)  TX bytes:52375617683 (52.3 GB)
>           Interrupt:32 Memory:da000000-da012800 
> 
> 
> On kvm-host-02:
> 
> [96851.637325] block drbd5: Digest integrity check FAILED.
> [96851.642704] block drbd5: error receiving Data, l: 4136!
> [96851.647918] block drbd5: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) 
> [96851.648121] block drbd5: asender terminated
> [96851.648126] block drbd5: Terminating asender thread
> [96851.648397] block drbd5: Connection closed
> [96851.648403] block drbd5: conn( ProtocolError -> Unconnected ) 
> [96851.648410] block drbd5: receiver terminated
> [96851.648412] block drbd5: Restarting receiver thread
> [96851.648415] block drbd5: receiver (re)started
> [96851.648419] block drbd5: conn( Unconnected -> WFConnection ) 
> [96852.100540] block drbd5: Handshake successful: Agreed network protocol version 91
> [96852.100961] block drbd5: Peer authenticated using 20 bytes of 'sha1' HMAC
> [96852.100970] block drbd5: conn( WFConnection -> WFReportParams ) 
> [96852.101158] block drbd5: Starting asender thread (from drbd5_receiver [2580])
> [96852.204023] block drbd5: data-integrity-alg: md5
> [96852.204054] block drbd5: drbd_sync_handshake:
> [96852.204059] block drbd5: self 05FED1060E9FE8BC:0000000000000000:1164DBBF12EF2784:E2DC4613A0140D1D bits:0 flags:0
> [96852.204063] block drbd5: peer 76B564F2F06575E9:05FED1060E9FE8BD:1164DBBF12EF2784:E2DC4613A0140D1D bits:12 flags:0
> [96852.204067] block drbd5: uuid_compare()=-1 by rule 50
> [96852.204073] block drbd5: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
> [96852.276554] block drbd5: conn( WFBitMapT -> WFSyncUUID ) 
> [96852.277357] block drbd5: helper command: /sbin/drbdadm before-resync-target minor-5
> [96852.281317] block drbd5: helper command: /sbin/drbdadm before-resync-target minor-5 exit code 0 (0x0)
> [96852.281324] block drbd5: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) 
> [96852.281331] block drbd5: Began resync as SyncTarget (will sync 48 KB [12 bits set]).
> [96852.321078] block drbd5: Resync done (total 1 sec; paused 0 sec; 48 K/sec)
> [96852.321089] block drbd5: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
> [96852.321097] block drbd5: helper command: /sbin/drbdadm after-resync-target minor-5
> [96852.376033] block drbd5: helper command: /sbin/drbdadm after-resync-target minor-5 exit code 1 (0x100)
> 
> 
> on kvm-host-01 (the fence-peer error messages are because this particular resource isn't managed by pacemaker yet):
> 
> [94974.614638] block drbd5: sock was shut down by peer
> [94974.614656] block drbd5: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
> [94974.614664] block drbd5: short read expecting header on sock: r=0
> [94974.614752] block drbd5: Creating new current UUID
> [94974.614763] block drbd5: sock_sendmsg returned -32
> [94974.614766] block drbd5: short sent ReportUUIDs size=56 sent=0
> [94974.614841] block drbd5: meta connection shut down by peer.
> [94974.614845] block drbd5: asender terminated
> [94974.614847] block drbd5: Terminating asender thread
> [94974.680561] block drbd5: Connection closed
> [94974.680569] block drbd5: helper command: /sbin/drbdadm fence-peer minor-5
> [94974.737763] block drbd5: helper command: /sbin/drbdadm fence-peer minor-5 exit code 1 (0x100)
> [94974.737767] block drbd5: fence-peer helper broken, returned 1
> [94974.753807] block drbd5: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
> [94974.786931] block drbd5:  old = { cs:BrokenPipe ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94974.820006] block drbd5:  new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94974.854860] block drbd5: conn( BrokenPipe -> Unconnected ) 
> [94974.854882] block drbd5: receiver terminated
> [94974.854885] block drbd5: Restarting receiver thread
> [94974.854887] block drbd5: receiver (re)started
> [94974.854895] block drbd5: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
> [94974.889697] block drbd5:  old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94974.924684] block drbd5:  new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94974.959571] block drbd5: conn( Unconnected -> WFConnection ) 
> [94975.067132] block drbd5: Handshake successful: Agreed network protocol version 91
> [94975.067473] block drbd5: Peer authenticated using 20 bytes of 'sha1' HMAC
> [94975.067481] block drbd5: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
> [94975.102822] block drbd5:  old = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94975.136769] block drbd5:  new = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown r--- }
> [94975.170071] block drbd5: conn( WFConnection -> WFReportParams ) 
> [94975.170212] block drbd5: Starting asender thread (from drbd5_receiver [2661])
> [94975.170320] block drbd5: data-integrity-alg: md5
> [94975.170334] block drbd5: drbd_sync_handshake:
> [94975.170339] block drbd5: self 76B564F2F06575E9:05FED1060E9FE8BD:1164DBBF12EF2784:E2DC4613A0140D1D bits:12 flags:0
> [94975.170343] block drbd5: peer 05FED1060E9FE8BC:0000000000000000:1164DBBF12EF2784:E2DC4613A0140D1D bits:0 flags:0
> [94975.170346] block drbd5: uuid_compare()=1 by rule 70
> [94975.170352] block drbd5: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) 
> [94975.243704] block drbd5: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) 
> [94975.243715] block drbd5: Began resync as SyncSource (will sync 48 KB [12 bits set]).
> [94975.287465] block drbd5: Resync done (total 1 sec; paused 0 sec; 48 K/sec)
> [94975.287474] block drbd5: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
> nickm at kvm-host-01:~$ 
> 
> 
> resource sc1.samoa.ws {
> 	protocol	C;
> 
> 	handlers {
> 		pri-on-incon-degr	"/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> 		pri-lost-after-sb	"/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> 		local-io-error	"/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
> 		after-resync-target	/usr/lib/drbd/crm-unfence-peer.sh;
> 	}
> 
> 	net {
> 		cram-hmac-alg	sha1;
> 		shared-secret	d4cfba2dcct279e22ff70d37eed6f7c1;
> 		after-sb-0pri	discard-zero-changes;
> 		after-sb-1pri	discard-secondary;
> 		data-integrity-alg	md5;
> 	}
> 
> 	on kvm-host-01 {
> 		device		/dev/drbd5;
> 		disk		/dev/kvm-host-01/sc1.samoa.ws;
> 		flexible-meta-disk	internal;
> 		address		172.16.1.1:7793;
> 	}
> 	on kvm-host-02 {
> 		device		/dev/drbd5;
> 		disk		/dev/kvm-host-02/sc1.samoa.ws;
> 		flexible-meta-disk	internal;
> 		address		172.16.1.2:7793;
> 	}
> }
> 
> 
> On 3 Nov 2011, at 13:36, Nick Morrison wrote:
> 
> >>> I'm now finding myself a cable to connect these hosts directly.
> >> 
> >> That may or may not solve your issue; the original motivation for the
> >> data integrity feature was to catch issues with NICs, not cables.
> > 
> > Used a new cable, directly connected the hosts, and disabled checksum
> > offloading on the NICs too.  For prosperity, here's what I did:
> > 
> > ethtool -k eth3
> > 
> > .. to check the current status, and
> > 
> > ethtool -K eth3 rx off
> > ethtool -K eth3 tx off
> > ethtool -K eth3 gso off
> > 
> > .. to disable offloading.
> > 
> > A couple of minutes later, I got the error again - block drbd3: Digest integrity
> > check FAILED.
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com



More information about the drbd-user mailing list