[DRBD-user] DRBD sync messages every ten seconds

Wed Mar 9 21:46:09 CET 2011

On Wed, Mar 09, 2011 at 10:49:08AM -0500, William Seligman wrote:
> On 3/4/11 4:58 PM, William Seligman wrote:
> > On 3/4/11 12:38 PM, William Seligman wrote:
> >> I've RTFM'ed and google'd on this problem. Now I ask the experts.
> > 
> > Now that I've joined this list, I looked at the archives directly. I see that
> > Cory Coager reported the same problem:
> > 
> > http://lists.linbit.com/pipermail/drbd-user/2011-March/015735.html

No. He reported an *entirely* different problem.

He gets integrity digest mismatches,
"Digest integrity check FAILED:".

You get DRBD ping-ack timeouts when idle,
"PingAck did not arrive in time."

> > Lars Ellenberg suggested that the problem was due to a bad NIC. Maybe... but
> > what are the odds that two different systems have a bad NIC?
> 
> I just did a little archeology. I haven't always experienced these regular DRBD
> sync error messages. They began when I made two changes to my configuration file:
> 
> - I switched from "Protocol C" to "Protocol A".
> - I added "net { ping-timeout 100; }"
> 
> Are either of these changes likely to cause problems?
>
> My next step would normally be to reverse those changes, but these are
> production systems and it's hard for me to perform tests.

d'oh.
:-)

you set ping-timeout to 10 seconds,
(Why would you do that? That does not make sense...)

which happens to be the default for ping-int[erval].
There is some code branch in the module code that relies on
these two timeouts to be different.  We will fix that.

Meanwhile, just reduce ping-timeout to something sane,
or set ping-int to 11 [actually anything != (ping-timeout/10) should do],
and do a
 drbdadm disconnect all; drbdadm connect all;
That should do the trick,
and can safely be done during production in you case,
as you have frequent disconnects anyways.

> >> Every ten seconds, the error messages at the end of this post appear in the log
> >> of the primary system and there are similar lines on the secondary system. It
> >> seems that drbd2 is losing its connection, re-establishing, and doing a re-sync.
> >>
> >> Everything works, most of the time. But once every few weeks there's enough of a
> >> delay that Corosync takes notice and STONITHs one of the systems, which is a big
> >> pain.

Uh?
Why would corosync notice anything here,
or even care about the DRBD connection status?

Pacemaker may.
Well, not really, but the ocf:linbit:drbd resource agent,
if called for monitor during such a period, will notice.
But that is unlikely.  And certainly would not cause a stonith,
at most some log messages.

Whatever causes your stonith events, this is not it.
You are jumping to conclusions here.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed