[DRBD-user] DRBD constantly re-syncing, getting to 100%, starting over. What?

Fri Oct 14 09:07:55 CEST 2016

> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-
> bounces at lists.linbit.com] On Behalf Of Lars Ellenberg
> Sent: Wednesday, October 12, 2016 11:49 PM
> To: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] DRBD constantly re-syncing, getting to 100%,
> starting over. What?
> 
> On Wed, Oct 12, 2016 at 04:35:58PM +0200, Jan Schermer wrote:
> > Short in the dark - are the drives (or their controller if you're
> > using raid) using any form of caching? It is conceivable that when
> > resync is finished it tries flushing the data to the device, and if
> > this takes waaaaay to long it could lead to timeout of the drbd kernel
> > thread.
> >
> > Is IO happening on those drives when they are resyncing?
> > Try running something like "sync ; sleep 1 ; sync" on the Inconsistent
> > node when it's resyncing (I hope that won't kill your IO)
> 
> sync only affects stuff in the linux (buffer/) page cache, DRBD sits below that.
> "no effect" on DRBD IO.
> 
> > > Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget
> (will sync 0 KB [0 bits set]).
> > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in
> time.
> > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary ->
> > > Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate ->
> > > DUnknown )
> 
> has been said before:
> DRBD ping timeout is apparently too short for the latency in your setup.
> increase it appropriately.
> 
> Where latency in this case involves network rtt plus kernel thread scheduling
> plus maybe additional synchronous (flush/fua) IO plus whatever else DRBD
> feels is necessary for a full DRBD to DRBD round-trip.
> 
> > > However, I can guarantee that the network connection is solid.
> > > Running ping flood, I get 30,000 packets sent with no loss or
> > > latency.
> 
> Mind telling us the network characteristics?  IO backend?
> Virtualized?  Distribution? Kernel and DRBD version(s)?
> 

We have a dozen other DRBD clusters and this has never happened to any of the others over the past decade or so, and they are all on the same switched network. The nodes are in different data centers 22 miles apart connected by gigabit fiber. Latency is always sub -millisecond. See the following ping test...

[root at ha14a ~]# ping -f ha14b-cl
PING ha14b-cl.mycharts.md (198.51.100.43) 56(84) bytes of data.
.^C
--- ha14b-cl.mycharts.md ping statistics ---
23433 packets transmitted, 23432 received, 0% packet loss, time 15911ms
rtt min/avg/max/mdev = 0.585/0.659/0.847/0.021 ms, ipg/ewma 0.679/0.658 ms

The servers are all physical, running RHEL 6.3 kernel 2.6.32-279.el6.x86_64. SSD drives.

DRBD version is 8.4.3

> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> 
> DRBD(r) and LINBIT(r) are registered trademarks of LINBIT __ please don't Cc
> me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user