Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
We are running RHEL5 and DRBD version: 8.3.8 (api:88/proto:86-94) from
CentOS Extra repository (no more update since the first install). Two
servers are used :
- Server_A (primary)
- Server_B (secondary)
Server_B is "inactive" regarding Server_A which is running applications (on
"primary" side to access DRBD) but, because I'm using "Atlassian Bamboo"
with commercial licence (and so, 1 remote agent free), I want to use the
Bamboo Remote Agent on the Server_B (has no need to access to DRBD) to get
better compilation performance/profitability.
So, all "was" fine before starting the "Bamboo Remote Agent" ... "was"
because I know that error "Digest integrity check FAILED" occur sometimes.
Before the Remote Agent :
--------------------
[1][root at Server_B ~]$ zcat /var/log/messages.4.gz /var/log/messages.3.gz |
grep FAILED
Apr 26 20:00:00 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 13:31:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 13:51:01 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 14:45:52 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 14:45:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 27 16:07:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 07:56:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 10:02:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 28 14:21:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:26:55 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
Apr 29 20:00:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:33 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:03:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 04:04:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 20:24:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 1 20:40:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:18 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:28 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:03:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:04:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:04:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:06:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:08 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:18 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:33 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:07:48 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:23 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:28 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:38 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:48 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:08:53 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:09:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:09:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:09:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 04:11:58 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 09:38:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 2 15:48:56 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 04:02:17 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 07:19:01 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 07:56:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 10:09:36 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 14:45:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 15:40:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 3 21:53:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 04:02:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 04:02:11 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 04:04:06 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 04:04:21 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 11:54:00 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 11:54:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 20:07:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 20:08:51 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 4 20:09:56 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 5 21:34:55 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 5 22:00:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 5 22:01:31 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 5 22:09:46 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 5 22:11:40 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 6 09:49:26 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 6 12:07:13 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 6 22:04:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:02:27 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:03:47 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:03:52 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:03:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:02 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:12 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:15 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:22 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:37 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:04:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 04:05:27 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 7 22:13:03 Server_B kernel: block drbd0: Digest integrity check FAILED.
--------------------
This is not very much errors for me (but, should be better).
So, right now, here are our configuration :
-----------------------------------------
Server_A =bond0[eth0,eth1]=> Gigabyte link => Server_B =bond0[eth0, eth1]
-----------------------------------------
Both servers are using the module "bonding" set as "active backup" (one
interface UP, the other in "hot-standby"). There is NO error at all with
"bonding" module : no several switch from one NIC to the other, etc ... All
is fine.
Here is my DRBD configuration file (some on both side, of course) :
----------------------------
global {
#don't send statistics through internet ...
usage-count no;
}
common {
protocol C;
syncer {
#replication speed
rate 20M;
#use compression for bitmap exchange
use-rle;
#set the "on-line device verification" algorithm (should be
triggered by a cronjob)
verify-alg sha1;
#set the "checksum-based synchronization" algorithm (used
when synchronizing)
csums-alg crc32c;
#tunning the activity log size
#default is 127 ; increment it when using intensive I/O
(write lot of small file)
al-extents 3389;
}
}
resource data-integration {
device /dev/drbd0;
meta-disk internal;
handlers {
#send mail for these events
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh
root";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh
root";
pri-lost "/usr/lib/drbd/notify-pri-lost.sh root";
local-io-error "/usr/lib/drbd/notify-io-error.sh root";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
#no notification script for these handlers or don't want to
work ...
#before-resync-target "/usr/lib/drbd/";
#after-resync-target "/usr/lib/drbd/";
#initial-split-brain "/usr/lib/drbd/notify-split-brain.sh
root";
}
disk {
#use "none" for write-after-write (because, we've got
battery for our server)
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
#should NOT be used ? only under special circumstances ?
But, we don't need to write in order ... so disable it
no-disk-drain;
#on I/O error, detach disk and use the remote peer disk
on-io-error detach;
}
net {
#authentication
cram-hmac-alg sha1;
shared-secret "<tralalilaleeere>";
#2x primary to use GFS ("rw/ro" or "rw/rw")
#this is not needed if you want to use "ext3/ext4"
filesystem ("rw/--" only)
##allow-two-primaries;
#set the "replication traffic integrity checking" algorithm
(used when replicating)
data-integrity-alg md5;
#split-brain (node = secondary/primary/both-primary ;
discard if no change/disconnect/disconnect)
#do nothing => disconnect
after-sb-0pri discard-zero-changes;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
#tuning recommendations (for RAID controler)
max-buffers 8000;
max-epoch-size 8000;
}
startup {
#dont wait infinitely (cause stuck on boot if not set)
wfc-timeout 15;
degr-wfc-timeout 15;
#when starting DRBD service, set one node to "primary"
(<node_name> or "both")
become-primary-on <Server_A>.<domain>;
}
on <Server_A>.<domain> {
address <ip_Server_B>:7788;
disk /dev/sda5;
}
on <Server_B>.<domain> {
address <ip_Server_B>:7788;
disk /dev/sda5;
}
}
----------------------------
Machines Server_A and B are both IBM servers with RAID + battery ... May be
there is some wrong values. The filesystem on DRBD is EXT3 and I don't know
if this values are ok :
-----------
#use "none" for write-after-write (because, we've got
battery for our server)
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
#should NOT be used ? only under special circumstances ?
But, we don't need to write in order ... so disable it
no-disk-drain;
-----------
I've read several post and to "try" to fix the "integrity check" errors, I
have disabled all "offload" on both "bond0", "eth0" and "eth1" on both
servers (ethtool -k <interface> show all set to OFF).
So, what's the problem ???
When starting the Bamboo Remote Agent, I got several errors on the integrity
check (several per minute). Since I disabled the "offload" of all NICs,
errors number has decreased but, still there :
--------------------
May 26 12:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:16:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:26:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:26:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:28:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:29:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:29:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:33:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:42:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:45:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:45:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:46:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:47:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:49:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:50:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:51:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:52:49 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:52:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:54:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 12:59:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:09:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:10:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:11:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:12:04 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:12:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:16:54 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:17:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:17:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:18:29 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:18:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:21:44 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:22:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
--------------------
(for now, I can't do better ...)
Sometimes, there is a "gap" of several minutes before the next error.
Sometimes not ...
I'm asking myself about what can trigger these errors ... so much errors ...
since I have already read many threads ... if I stop the Remote Agent on
Server_B, problem "disappear" (still there, but less present) ... I will
stop it now (have to eat) and continue this message after that ...
ZZzzZZzzzZZz
Ok, I start the Remote Agent again on Server_B ... wait some minutes and
then ...
-----------------
May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:38:14 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 13:40:09 Server_B kernel: block drbd0: Digest integrity check FAILED.
[stop & eat time ...]
[start the Bamboo Remote Agent]
May 26 14:39:57 Server_B kernel: block drbd0: Digest integrity check FAILED.
May 26 14:44:07 Server_B kernel: block drbd0: Digest integrity check FAILED.
-----------------
As I have already told you, the number of errors has decreased since
"offload" disable ... but, why is there so many errors remaining ???
I tried to load both Server_A and Server_B by sending network traffic
(bidirectional) at the maximum bandwidth (40Mb/s .. have reached 10Gb)...
but I never reach the same strange behavior as the simple start of the
Bamboo Remote Agent ...
I use now Wireshark to sniff packets ... I found the traffic of the remote
agent ... It's minimalist and don't help me. Every 5 secondes, the Remote
Agent send a heartbeat from Server_B to Server_A (TCP) using, every time,
the same source port on Server_B to reach the same destination port on
Server_A (:54663) :
--------------------
(heartbeat content)
......!Al.....k.$.%....n.$............0+..e...4...0<com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage>
<agentId>32541146</agentId>
<status class="com.atlassian.bamboo.v2.build.agent.AgentIdleStatus"/>
<systemInfo>
<userName>root</userName>
<userTimezone>Europe/Brussels</userTimezone>
<userLocale>English (United States)</userLocale>
<systemEncoding>UTF-8</systemEncoding>
<operatingSystem>Linux 2.6.18-194.11.3.el5</operatingSystem>
<operatingSystemArchitecture>amd64</operatingSystemArchitecture>
<systemDate>Thursday, 26 May 2011</systemDate>
<systemTime>11:04:59</systemTime>
<tempDir>/tmp</tempDir>
<totalMemory>256</totalMemory>
<freeMemory>227</freeMemory>
<usedMemory>28</usedMemory>
<availableProcessors>4</availableProcessors>
<startupTimestamp>1301163182237</startupTimestamp>
<currentDirectory>/mnt/data1/bamboo_agent_home/bin</currentDirectory>
<applicationHome>/mnt/data1/bamboo_agent_home</applicationHome>
<buildWorkingDirectory>/mnt/data1/bamboo_agent_home/xml-data/build-dir</buildWorkingDirectory>
<freeDiskSpace>55 GB</freeDiskSpace>
<currentDate>2011-05-26 11:04:59.474 CEST</currentDate>
<hostName><Server_B>.<domain></hostName>
<ipAddress><ip_Server_B></ipAddress>
</systemInfo>
</com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage>...(......fingerprint...-1542188871550963522....................g......!Al.....l.$.%....z.$............0
--------------------
Don't tell me that this little "heartbeat" can cause so many problem with
DRBD ....
I have one question on the DRBD disconnect/reconnect process when "integrity
check" error are triggered ... On reconnect, will data (not yet synced) be
correctly "resynchronized" ??? Or should I run an "online-verify" after
every disconnection (which is, of course, impossible for us with 200Gb and
several errors by hour ...) ?
I will need some new good idea because now, I'm stuck ...
F1 F1 :)
--
View this message in context: http://old.nabble.com/Strange-number-of-%27Digest-integrity-check-FAILED%27-after-starting-one-agent-tp31707977p31707977.html
Sent from the DRBD - User mailing list archive at Nabble.com.