Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Mar 09, 2011 at 10:49:08AM -0500, William Seligman wrote: > On 3/4/11 4:58 PM, William Seligman wrote: > > On 3/4/11 12:38 PM, William Seligman wrote: > >> I've RTFM'ed and google'd on this problem. Now I ask the experts. > > > > Now that I've joined this list, I looked at the archives directly. I see that > > Cory Coager reported the same problem: > > > > http://lists.linbit.com/pipermail/drbd-user/2011-March/015735.html No. He reported an *entirely* different problem. He gets integrity digest mismatches, "Digest integrity check FAILED:". You get DRBD ping-ack timeouts when idle, "PingAck did not arrive in time." > > Lars Ellenberg suggested that the problem was due to a bad NIC. Maybe... but > > what are the odds that two different systems have a bad NIC? > > I just did a little archeology. I haven't always experienced these regular DRBD > sync error messages. They began when I made two changes to my configuration file: > > - I switched from "Protocol C" to "Protocol A". > - I added "net { ping-timeout 100; }" > > Are either of these changes likely to cause problems? > > My next step would normally be to reverse those changes, but these are > production systems and it's hard for me to perform tests. d'oh. :-) you set ping-timeout to 10 seconds, (Why would you do that? That does not make sense...) which happens to be the default for ping-int[erval]. There is some code branch in the module code that relies on these two timeouts to be different. We will fix that. Meanwhile, just reduce ping-timeout to something sane, or set ping-int to 11 [actually anything != (ping-timeout/10) should do], and do a drbdadm disconnect all; drbdadm connect all; That should do the trick, and can safely be done during production in you case, as you have frequent disconnects anyways. > >> Every ten seconds, the error messages at the end of this post appear in the log > >> of the primary system and there are similar lines on the secondary system. It > >> seems that drbd2 is losing its connection, re-establishing, and doing a re-sync. > >> > >> Everything works, most of the time. But once every few weeks there's enough of a > >> delay that Corosync takes notice and STONITHs one of the systems, which is a big > >> pain. Uh? Why would corosync notice anything here, or even care about the DRBD connection status? Pacemaker may. Well, not really, but the ocf:linbit:drbd resource agent, if called for monitor during such a period, will notice. But that is unlikely. And certainly would not cause a stonith, at most some log messages. Whatever causes your stonith events, this is not it. You are jumping to conclusions here. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed