[DRBD-user] Strange number of 'Digest integrity check FAILED' after starting one agent

loopx laurent.henssen at gmail.com
Thu May 26 17:06:39 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello, 


We are running RHEL5 and DRBD version: 8.3.8 (api:88/proto:86-94) from
CentOS Extra repository (no more update since the first install). Two
servers are used :
- Server_A (primary)
- Server_B (secondary)


Server_B is "inactive" regarding Server_A which is running applications (on
"primary" side to access DRBD) but, because I'm using "Atlassian Bamboo"
with commercial licence (and so, 1 remote agent free), I want to use the
Bamboo Remote Agent on the Server_B (has no need to access to DRBD) to get
better compilation performance/profitability.


So, all "was" fine before starting the "Bamboo Remote Agent" ... "was"
because I know that error "Digest integrity check FAILED" occur sometimes.

Before the Remote Agent :
--------------------
[1][root at Server_B ~]$ zcat /var/log/messages.4.gz /var/log/messages.3.gz |
grep FAILED
Apr 26 20:00:00 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 13:31:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 13:51:01 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 14:45:52 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 14:45:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 16:07:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 07:56:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 10:02:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 14:21:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:26:55 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 20:00:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:33 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:03:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 04:04:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 20:24:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  1 20:40:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:18 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:28 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:03:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:04:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:04:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:06:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:18 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:33 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:07:48 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:28 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:48 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:08:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:09:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:09:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:09:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 04:11:58 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 09:38:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  2 15:48:56 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 04:02:17 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 07:19:01 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 07:56:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 10:09:36 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 14:45:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 15:40:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  3 21:53:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 04:02:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 04:02:11 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 04:04:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 04:04:21 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 11:54:00 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 11:54:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 20:07:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 20:08:51 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  4 20:09:56 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  5 21:34:55 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  5 22:00:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  5 22:01:31 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  5 22:09:46 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  5 22:11:40 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  6 09:49:26 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  6 12:07:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  6 22:04:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:02:27 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:03:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:03:52 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:03:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:15 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:22 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:04:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 04:05:27 Server_B kernel: block drbd0: Digest integrity check FAILED.
May  7 22:13:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
--------------------

This is not very much errors for me (but, should be better).


So, right now, here are our configuration :
-----------------------------------------
Server_A =bond0[eth0,eth1]=> Gigabyte link => Server_B =bond0[eth0, eth1]
-----------------------------------------

Both servers are using the module "bonding" set as "active backup" (one
interface UP, the other in "hot-standby"). There is NO error at all with
"bonding" module : no several switch from one NIC to the other, etc ... All
is fine.


Here is my DRBD configuration file (some on both side, of course) :
----------------------------
global {
        #don't send statistics through internet ...
        usage-count no;
}

common {
        protocol C;

        syncer {
                #replication speed
                rate 20M;

                #use compression for bitmap exchange
                use-rle;

                #set the "on-line device verification" algorithm (should be
triggered by a cronjob)
                verify-alg sha1;

                #set the "checksum-based synchronization" algorithm (used
when synchronizing)
                csums-alg crc32c;

                #tunning the activity log size
                #default is 127 ; increment it when using intensive I/O
(write lot of small file)
                al-extents 3389;
        }
}

resource data-integration {
        device /dev/drbd0;
        meta-disk internal;

        handlers {
                #send mail for these events
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh
root";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh
root";
                pri-lost "/usr/lib/drbd/notify-pri-lost.sh root";
                local-io-error "/usr/lib/drbd/notify-io-error.sh root";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";

                #no notification script for these handlers or don't want to
work ...
                #before-resync-target "/usr/lib/drbd/";
                #after-resync-target "/usr/lib/drbd/";
                #initial-split-brain "/usr/lib/drbd/notify-split-brain.sh
root";
        }

        disk {
                #use "none" for write-after-write (because, we've got
battery for our server)
                no-disk-barrier;
                no-disk-flushes;
                no-md-flushes;
                #should NOT be used ? only under special circumstances ?
But, we don't need to write in order ... so disable it
                no-disk-drain;

                #on I/O error, detach disk and use the remote peer disk
                on-io-error detach;
        }

        net {
                #authentication
                cram-hmac-alg sha1;
                shared-secret "<tralalilaleeere>";

                #2x primary to use GFS ("rw/ro" or "rw/rw")
                #this is not needed if you want to use "ext3/ext4"
filesystem ("rw/--" only)
                ##allow-two-primaries;

                #set the "replication traffic integrity checking" algorithm
(used when replicating)
                data-integrity-alg md5;

                #split-brain (node = secondary/primary/both-primary ;
discard if no change/disconnect/disconnect)
                #do nothing => disconnect
                after-sb-0pri discard-zero-changes;
                after-sb-1pri disconnect;
                after-sb-2pri disconnect;

                #tuning recommendations (for RAID controler)
                max-buffers 8000;
                max-epoch-size 8000;
        }

        startup {
                #dont wait infinitely (cause stuck on boot if not set)
                wfc-timeout 15;
                degr-wfc-timeout 15;

                #when starting DRBD service, set one node to "primary"
(<node_name> or "both")
                become-primary-on <Server_A>.<domain>;
        }

        on <Server_A>.<domain> {
                address         <ip_Server_B>:7788;
                disk            /dev/sda5;
        }
        on <Server_B>.<domain> {
                address         <ip_Server_B>:7788;
                disk            /dev/sda5;
        }
}
----------------------------

Machines Server_A and B are both IBM servers with RAID + battery ... May be
there is some wrong values. The filesystem on DRBD is EXT3 and I don't know
if this values are ok :
-----------
                #use "none" for write-after-write (because, we've got
battery for our server)
                no-disk-barrier;
                no-disk-flushes;
                no-md-flushes;
                #should NOT be used ? only under special circumstances ?
But, we don't need to write in order ... so disable it
                no-disk-drain;
-----------


I've read several post and to "try" to fix the "integrity check" errors, I
have disabled all "offload" on both "bond0", "eth0" and "eth1" on both
servers (ethtool -k <interface> show all set to OFF).


So, what's the problem ???
When starting the Bamboo Remote Agent, I got several errors on the integrity
check (several per minute). Since I disabled the "offload" of all NICs,
errors number has decreased but, still there :
--------------------
May 26 12:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:16:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:26:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:26:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:28:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:29:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:29:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:33:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:42:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:45:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:45:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:46:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:47:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:50:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:52:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:52:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:54:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:59:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:09:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:10:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:11:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:12:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:12:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:17:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:17:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:18:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:18:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:21:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:22:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
--------------------
(for now, I can't do better ...)

Sometimes, there is a "gap" of several minutes before the next error.
Sometimes not ...

I'm asking myself about what can trigger these errors ... so much errors ...
since I have already read  many threads ... if I stop the Remote Agent on
Server_B, problem "disappear" (still there, but less present) ... I will
stop it now (have to eat) and continue this message after that ...

ZZzzZZzzzZZz

Ok, I start the Remote Agent again on Server_B ... wait some minutes and
then ...
-----------------
May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:38:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:40:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
[stop & eat time ...]

[start the Bamboo Remote Agent]
May 26 14:39:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 14:44:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
-----------------

As I have already told you, the number of errors has decreased since
"offload" disable ... but, why is there so many errors remaining ??? 
I tried to load both Server_A and Server_B by sending network traffic
(bidirectional) at the maximum bandwidth (40Mb/s .. have reached 10Gb)...
but I never reach the same strange behavior as the simple start of the
Bamboo Remote Agent ...

I use now Wireshark to sniff packets ... I found the traffic of the remote
agent ... It's minimalist and don't help me. Every 5 secondes, the Remote
Agent send a heartbeat from Server_B to Server_A (TCP) using, every time,
the same source port on Server_B to reach the same destination port on
Server_A (:54663) :
--------------------
(heartbeat content)
......!Al.....k.$.%....n.$............0+..e...4...0<com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage>

  <agentId>32541146</agentId>

  <status class="com.atlassian.bamboo.v2.build.agent.AgentIdleStatus"/>

  <systemInfo>

    <userName>root</userName>

    <userTimezone>Europe/Brussels</userTimezone>

    <userLocale>English (United States)</userLocale>

    <systemEncoding>UTF-8</systemEncoding>

    <operatingSystem>Linux 2.6.18-194.11.3.el5</operatingSystem>

    <operatingSystemArchitecture>amd64</operatingSystemArchitecture>

    <systemDate>Thursday, 26 May 2011</systemDate>

    <systemTime>11:04:59</systemTime>

    <tempDir>/tmp</tempDir>

    <totalMemory>256</totalMemory>

    <freeMemory>227</freeMemory>

    <usedMemory>28</usedMemory>

    <availableProcessors>4</availableProcessors>

    <startupTimestamp>1301163182237</startupTimestamp>

    <currentDirectory>/mnt/data1/bamboo_agent_home/bin</currentDirectory>

    <applicationHome>/mnt/data1/bamboo_agent_home</applicationHome>

   
<buildWorkingDirectory>/mnt/data1/bamboo_agent_home/xml-data/build-dir</buildWorkingDirectory>

    <freeDiskSpace>55 GB</freeDiskSpace>

    <currentDate>2011-05-26 11:04:59.474 CEST</currentDate>

    <hostName><Server_B>.<domain></hostName>

    <ipAddress><ip_Server_B></ipAddress>

  </systemInfo>

</com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage>...(......fingerprint...-1542188871550963522....................g......!Al.....l.$.%....z.$............0
--------------------


Don't tell me that this little "heartbeat" can cause so many problem with
DRBD ....

I have one question on the DRBD disconnect/reconnect process when "integrity
check" error are triggered ... On reconnect, will data (not yet synced) be
correctly "resynchronized" ??? Or should I run an "online-verify" after
every disconnection (which is, of course, impossible for us with 200Gb and
several errors by hour ...) ?

I will need some new good idea because now, I'm stuck ... 

F1 F1 :)
-- 
View this message in context: http://old.nabble.com/Strange-number-of-%27Digest-integrity-check-FAILED%27-after-starting-one-agent-tp31707977p31707977.html
Sent from the DRBD - User mailing list archive at Nabble.com.




More information about the drbd-user mailing list