[DRBD-user] DRBD ping-timeout values

Fri Apr 4 16:50:12 CEST 2008

On 4/4/08, Florian Haas <florian.haas at linbit.com> wrote:
> On Friday 04 April 2008 15:18:21 George H wrote:
>  > OK I upgraded both my blades to the latest stable kernel 2.6.24.
>  > Rebuilt drbd 8.0.8 and restarted the sync.
>  >
>  > I noticed off hand the connection to get sync was quicker than before
>  >
>  > Apr  4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS )
>  > Apr  4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource )
>  >
>  > normally it used to take 5 or more minutes.
>  >
>  > But as that was quick.. so was the "network failure" see below
>  >
>  > Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block now.
>  > Apr  4 15:49:35 mailserv1 drbd0: Becoming sync source due to disk states.
>  > Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block now.
>  > Apr  4 15:49:35 mailserv1 drbd0: writing of bitmap took 7 jiffies
>  > Apr  4 15:49:35 mailserv1 drbd0: 476 GB (124997941 bits) marked
>  > out-of-sync by on disk bit-map.
>  > Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block now.
>  > Apr  4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS )
>  > Apr  4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource )
>  > Apr  4 15:50:18 mailserv1 drbd0: Began resync as SyncSource (will sync
>  > 499991764 KB [124997941 bits set]).
>  > Apr  4 15:50:18 mailserv1 drbd0: Writing meta data super block now.
>  > Apr  4 16:03:26 mailserv1 drbd0: PingAck did not arrive in time.
>  > Apr  4 16:03:26 mailserv1 drbd0: peer( Secondary -> Unknown ) conn(
>  > SyncSource -> NetworkFailure )
>  > Apr  4 16:03:26 mailserv1 drbd0: asender terminated
>  > Apr  4 16:03:26 mailserv1 drbd0: drbd_pp_alloc interrupted!
>  > Apr  4 16:03:26 mailserv1 drbd0: alloc_ee: Allocation of a page failed
>  > Apr  4 16:03:26 mailserv1 drbd0: error receiving RSDataRequest, l: 24!
>  > Apr  4 16:03:26 mailserv1 drbd0: tl_clear()
>  > Apr  4 16:03:26 mailserv1 drbd0: Connection closed
>  > Apr  4 16:03:26 mailserv1 drbd0: Writing meta data super block now.
>  > Apr  4 16:03:26 mailserv1 drbd0: conn( NetworkFailure -> Unconnected )
>  > Apr  4 16:03:26 mailserv1 drbd0: receiver terminated
>  > Apr  4 16:03:26 mailserv1 drbd0: receiver (re)started
>  > Apr  4 16:03:26 mailserv1 drbd0: conn( Unconnected -> WFConnection )
>  > Apr  4 16:03:26 mailserv1 drbd0: Handshake successful: DRBD Network
>  > Protocol version 86
>
>
> OK so that's a very quick disconnection and subsequent reconnection. How often
>  does that occur? Do you ever get network interruptions for longer periods?
>  When you do, what does "tcpdump -i <your replication interface>" say?

This disconnection happens often. Right now it happens every 10-20
minutes. We don't get network interruptions at all. On monday i'm
going to try to connect the two blade chassis via a cross over link
completely excluding the switch just to see if the switch is the
problem.

I got the tcpdump log of the entire sync session up to the failure.
It's huge and I don't know what I'm supposed to look for in it. I
'grep'ed out the timeslot where the pingAck occured. it all looks
alien to me.

What am I looking for in the tcpdump logs?

Thanks

>  I strongly suspect at this point all your DRBD tuning efforts, while
>  admirable, are futile. You really need to fix your network stack first.
>
>
>  Cheers,
>  Florian
>
>  --
>  : Florian G. Haas
>  : LINBIT Information Technologies GmbH
>  : Vivenotgasse 48, A-1120 Vienna, Austria
>
>  When replying, there is no need to CC my personal address.
>  I monitor the list on a daily basis. Thank you.
>  _______________________________________________
>  drbd-user mailing list
>  drbd-user at lists.linbit.com
>  http://lists.linbit.com/mailman/listinfo/drbd-user
>