Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, We are running RHEL5 and DRBD version: 8.3.8 (api:88/proto:86-94) from CentOS Extra repository (no more update since the first install). Two servers are used : - Server_A (primary) - Server_B (secondary) Server_B is "inactive" regarding Server_A which is running applications (on "primary" side to access DRBD) but, because I'm using "Atlassian Bamboo" with commercial licence (and so, 1 remote agent free), I want to use the Bamboo Remote Agent on the Server_B (has no need to access to DRBD) to get better compilation performance/profitability. So, all "was" fine before starting the "Bamboo Remote Agent" ... "was" because I know that error "Digest integrity check FAILED" occur sometimes. Before the Remote Agent : -------------------- [1][root at Server_B ~]$ zcat /var/log/messages.4.gz /var/log/messages.3.gz | grep FAILED Apr 26 20:00:00 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 13:31:04 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 13:51:01 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 14:45:52 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 14:45:57 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 27 16:07:53 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 28 07:56:38 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 28 10:02:37 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 28 14:21:08 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 29 15:26:55 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 29 15:53:37 Server_B kernel: block drbd0: Digest integrity check FAILED. Apr 29 20:00:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:33 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:37 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:44 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:49 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:03:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 04:04:19 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 20:24:37 Server_B kernel: block drbd0: Digest integrity check FAILED. May 1 20:40:47 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:02:02 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:02:12 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:08 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:18 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:23 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:28 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:03:53 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:04:23 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:04:53 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:06:53 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:03 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:08 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:18 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:23 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:33 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:38 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:07:48 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:03 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:13 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:23 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:28 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:38 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:48 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:08:53 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:09:03 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:09:13 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:09:39 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 04:11:58 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 09:38:06 Server_B kernel: block drbd0: Digest integrity check FAILED. May 2 15:48:56 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 04:02:17 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 04:02:42 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 07:19:01 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 07:56:13 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 10:09:36 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 14:45:06 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 15:40:06 Server_B kernel: block drbd0: Digest integrity check FAILED. May 3 21:53:13 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 04:02:06 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 04:02:11 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 04:04:06 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 04:04:21 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 11:54:00 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 11:54:07 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 20:07:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 20:08:51 Server_B kernel: block drbd0: Digest integrity check FAILED. May 4 20:09:56 Server_B kernel: block drbd0: Digest integrity check FAILED. May 5 21:34:55 Server_B kernel: block drbd0: Digest integrity check FAILED. May 5 22:00:47 Server_B kernel: block drbd0: Digest integrity check FAILED. May 5 22:01:31 Server_B kernel: block drbd0: Digest integrity check FAILED. May 5 22:09:46 Server_B kernel: block drbd0: Digest integrity check FAILED. May 5 22:11:40 Server_B kernel: block drbd0: Digest integrity check FAILED. May 6 09:49:26 Server_B kernel: block drbd0: Digest integrity check FAILED. May 6 12:07:13 Server_B kernel: block drbd0: Digest integrity check FAILED. May 6 22:04:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:02:07 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:02:27 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:02:32 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:03:47 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:03:52 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:03:57 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:02 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:07 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:12 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:15 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:22 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:37 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:04:57 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 04:05:27 Server_B kernel: block drbd0: Digest integrity check FAILED. May 7 22:13:03 Server_B kernel: block drbd0: Digest integrity check FAILED. -------------------- This is not very much errors for me (but, should be better). So, right now, here are our configuration : ----------------------------------------- Server_A =bond0[eth0,eth1]=> Gigabyte link => Server_B =bond0[eth0, eth1] ----------------------------------------- Both servers are using the module "bonding" set as "active backup" (one interface UP, the other in "hot-standby"). There is NO error at all with "bonding" module : no several switch from one NIC to the other, etc ... All is fine. Here is my DRBD configuration file (some on both side, of course) : ---------------------------- global { #don't send statistics through internet ... usage-count no; } common { protocol C; syncer { #replication speed rate 20M; #use compression for bitmap exchange use-rle; #set the "on-line device verification" algorithm (should be triggered by a cronjob) verify-alg sha1; #set the "checksum-based synchronization" algorithm (used when synchronizing) csums-alg crc32c; #tunning the activity log size #default is 127 ; increment it when using intensive I/O (write lot of small file) al-extents 3389; } } resource data-integration { device /dev/drbd0; meta-disk internal; handlers { #send mail for these events pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh root"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh root"; pri-lost "/usr/lib/drbd/notify-pri-lost.sh root"; local-io-error "/usr/lib/drbd/notify-io-error.sh root"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; #no notification script for these handlers or don't want to work ... #before-resync-target "/usr/lib/drbd/"; #after-resync-target "/usr/lib/drbd/"; #initial-split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } disk { #use "none" for write-after-write (because, we've got battery for our server) no-disk-barrier; no-disk-flushes; no-md-flushes; #should NOT be used ? only under special circumstances ? But, we don't need to write in order ... so disable it no-disk-drain; #on I/O error, detach disk and use the remote peer disk on-io-error detach; } net { #authentication cram-hmac-alg sha1; shared-secret "<tralalilaleeere>"; #2x primary to use GFS ("rw/ro" or "rw/rw") #this is not needed if you want to use "ext3/ext4" filesystem ("rw/--" only) ##allow-two-primaries; #set the "replication traffic integrity checking" algorithm (used when replicating) data-integrity-alg md5; #split-brain (node = secondary/primary/both-primary ; discard if no change/disconnect/disconnect) #do nothing => disconnect after-sb-0pri discard-zero-changes; after-sb-1pri disconnect; after-sb-2pri disconnect; #tuning recommendations (for RAID controler) max-buffers 8000; max-epoch-size 8000; } startup { #dont wait infinitely (cause stuck on boot if not set) wfc-timeout 15; degr-wfc-timeout 15; #when starting DRBD service, set one node to "primary" (<node_name> or "both") become-primary-on <Server_A>.<domain>; } on <Server_A>.<domain> { address <ip_Server_B>:7788; disk /dev/sda5; } on <Server_B>.<domain> { address <ip_Server_B>:7788; disk /dev/sda5; } } ---------------------------- Machines Server_A and B are both IBM servers with RAID + battery ... May be there is some wrong values. The filesystem on DRBD is EXT3 and I don't know if this values are ok : ----------- #use "none" for write-after-write (because, we've got battery for our server) no-disk-barrier; no-disk-flushes; no-md-flushes; #should NOT be used ? only under special circumstances ? But, we don't need to write in order ... so disable it no-disk-drain; ----------- I've read several post and to "try" to fix the "integrity check" errors, I have disabled all "offload" on both "bond0", "eth0" and "eth1" on both servers (ethtool -k <interface> show all set to OFF). So, what's the problem ??? When starting the Bamboo Remote Agent, I got several errors on the integrity check (several per minute). Since I disabled the "offload" of all NICs, errors number has decreased but, still there : -------------------- May 26 12:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:16:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:26:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:26:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:28:09 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:29:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:29:39 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:33:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:42:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:45:49 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:45:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:46:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:47:09 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:49:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:49:19 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:49:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:49:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:49:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:50:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:51:04 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:51:09 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:51:44 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:51:49 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:51:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:52:49 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:52:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:54:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 12:59:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:09:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:10:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:11:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:12:04 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:12:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:14:59 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:16:09 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:16:19 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:16:54 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:17:09 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:17:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:18:29 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:18:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:21:44 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:22:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED. -------------------- (for now, I can't do better ...) Sometimes, there is a "gap" of several minutes before the next error. Sometimes not ... I'm asking myself about what can trigger these errors ... so much errors ... since I have already read many threads ... if I stop the Remote Agent on Server_B, problem "disappear" (still there, but less present) ... I will stop it now (have to eat) and continue this message after that ... ZZzzZZzzzZZz Ok, I start the Remote Agent again on Server_B ... wait some minutes and then ... ----------------- May 26 13:22:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:24:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:24:34 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:25:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:25:39 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:26:24 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:27:19 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:38:14 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 13:40:09 Server_B kernel: block drbd0: Digest integrity check FAILED. [stop & eat time ...] [start the Bamboo Remote Agent] May 26 14:39:57 Server_B kernel: block drbd0: Digest integrity check FAILED. May 26 14:44:07 Server_B kernel: block drbd0: Digest integrity check FAILED. ----------------- As I have already told you, the number of errors has decreased since "offload" disable ... but, why is there so many errors remaining ??? I tried to load both Server_A and Server_B by sending network traffic (bidirectional) at the maximum bandwidth (40Mb/s .. have reached 10Gb)... but I never reach the same strange behavior as the simple start of the Bamboo Remote Agent ... I use now Wireshark to sniff packets ... I found the traffic of the remote agent ... It's minimalist and don't help me. Every 5 secondes, the Remote Agent send a heartbeat from Server_B to Server_A (TCP) using, every time, the same source port on Server_B to reach the same destination port on Server_A (:54663) : -------------------- (heartbeat content) ......!Al.....k.$.%....n.$............0+..e...4...0<com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage> <agentId>32541146</agentId> <status class="com.atlassian.bamboo.v2.build.agent.AgentIdleStatus"/> <systemInfo> <userName>root</userName> <userTimezone>Europe/Brussels</userTimezone> <userLocale>English (United States)</userLocale> <systemEncoding>UTF-8</systemEncoding> <operatingSystem>Linux 2.6.18-194.11.3.el5</operatingSystem> <operatingSystemArchitecture>amd64</operatingSystemArchitecture> <systemDate>Thursday, 26 May 2011</systemDate> <systemTime>11:04:59</systemTime> <tempDir>/tmp</tempDir> <totalMemory>256</totalMemory> <freeMemory>227</freeMemory> <usedMemory>28</usedMemory> <availableProcessors>4</availableProcessors> <startupTimestamp>1301163182237</startupTimestamp> <currentDirectory>/mnt/data1/bamboo_agent_home/bin</currentDirectory> <applicationHome>/mnt/data1/bamboo_agent_home</applicationHome> <buildWorkingDirectory>/mnt/data1/bamboo_agent_home/xml-data/build-dir</buildWorkingDirectory> <freeDiskSpace>55 GB</freeDiskSpace> <currentDate>2011-05-26 11:04:59.474 CEST</currentDate> <hostName><Server_B>.<domain></hostName> <ipAddress><ip_Server_B></ipAddress> </systemInfo> </com.atlassian.bamboo.v2.build.agent.messages.UpdateHeartbeatMessage>...(......fingerprint...-1542188871550963522....................g......!Al.....l.$.%....z.$............0 -------------------- Don't tell me that this little "heartbeat" can cause so many problem with DRBD .... I have one question on the DRBD disconnect/reconnect process when "integrity check" error are triggered ... On reconnect, will data (not yet synced) be correctly "resynchronized" ??? Or should I run an "online-verify" after every disconnection (which is, of course, impossible for us with 200Gb and several errors by hour ...) ? I will need some new good idea because now, I'm stuck ... F1 F1 :) -- View this message in context: http://old.nabble.com/Strange-number-of-%27Digest-integrity-check-FAILED%27-after-starting-one-agent-tp31707977p31707977.html Sent from the DRBD - User mailing list archive at Nabble.com.