[DRBD-user] DRBD ping-timeout values

George H george.dma at gmail.com
Fri Apr 4 17:40:49 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 4/4/08, Florian Haas <florian.haas at linbit.com> wrote:
> On Friday 04 April 2008 16:50:12 George H wrote:
>  > On 4/4/08, Florian Haas <florian.haas at linbit.com> wrote:
>  > > On Friday 04 April 2008 15:18:21 George H wrote:
>  > >  > OK I upgraded both my blades to the latest stable kernel 2.6.24.
>  > >  > Rebuilt drbd 8.0.8 and restarted the sync.
>  > >  >
>  > >  > I noticed off hand the connection to get sync was quicker than before
>  > >  >
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS )
>  > >  > Apr  4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource )
>  > >  >
>  > >  > normally it used to take 5 or more minutes.
>  > >  >
>  > >  > But as that was quick.. so was the "network failure" see below
>  > >  >
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block now.
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: Becoming sync source due to disk
>  > >  > states. Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block
>  > >  > now. Apr  4 15:49:35 mailserv1 drbd0: writing of bitmap took 7 jiffies
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: 476 GB (124997941 bits) marked
>  > >  > out-of-sync by on disk bit-map.
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: Writing meta data super block now.
>  > >  > Apr  4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS )
>  > >  > Apr  4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource )
>  > >  > Apr  4 15:50:18 mailserv1 drbd0: Began resync as SyncSource (will sync
>  > >  > 499991764 KB [124997941 bits set]).
>  > >  > Apr  4 15:50:18 mailserv1 drbd0: Writing meta data super block now.
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: PingAck did not arrive in time.
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: peer( Secondary -> Unknown ) conn(
>  > >  > SyncSource -> NetworkFailure )
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: asender terminated
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: drbd_pp_alloc interrupted!
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: alloc_ee: Allocation of a page failed
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: error receiving RSDataRequest, l: 24!
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: tl_clear()
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: Connection closed
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: Writing meta data super block now.
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: conn( NetworkFailure -> Unconnected )
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: receiver terminated
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: receiver (re)started
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: conn( Unconnected -> WFConnection )
>  > >  > Apr  4 16:03:26 mailserv1 drbd0: Handshake successful: DRBD Network
>  > >  > Protocol version 86
>  > >
>  > > OK so that's a very quick disconnection and subsequent reconnection. How
>  > > often does that occur? Do you ever get network interruptions for longer
>  > > periods? When you do, what does "tcpdump -i <your replication interface>"
>  > > say?
>  >
>  > This disconnection happens often. Right now it happens every 10-20
>  > minutes. We don't get network interruptions at all.
>
>
> Oh yes, on your replication link you certainly do. At least your log excerpt
>  says so.
>
>
>  > On monday i'm
>  > going to try to connect the two blade chassis via a cross over link
>  > completely excluding the switch just to see if the switch is the
>  > problem.
>
>
> Good call.
>
>
>  > I got the tcpdump log of the entire sync session up to the failure.
>
>
> That is probably of little use. It would only be interesting to see what
>  happens while DRBD is forcefully disconnect (unanswered ARP requests, etc.).
>
>
>  > It's huge and I don't know what I'm supposed to look for in it. I
>  > 'grep'ed out the timeslot where the pingAck occured. it all looks
>  > alien to me.
>
>
> No offense please, but I suggest you bring someone in to whom a packet trace
>  doesn't look alien. If it does to you, you're going to have a very hard time
>  troubleshooting your network stack.

Well I Got some good news. I read some tuts on tcpdump and wireshark
and outputted the dump into wireshark format (so I can view it
graphically) and.. wow. Now I know what hte problem is.. and your
right that it has to do with the network stack.

This log file is loaded with TCP Dup Ack, Out-Of-Order, and
Retransmission packets. There is definately something wrong.. most
likely with our Alcatel switch. I'm no expert on networking but this
may be something with the MTU size? possiblely something to do with
fargmentation of packets. Well at least now I can properly display my
case to our networking people and hope they learn the errors of their
ways :P

Thanks Florian and Lars for helping me trouble shoot this.


>  Cheers,
>  Florian
>
>  --
>  : Florian G. Haas
>  : LINBIT Information Technologies GmbH
>  : Vivenotgasse 48, A-1120 Vienna, Austria
>
>  When replying, there is no need to CC my personal address.
>  I monitor the list on a daily basis. Thank you.
>  _______________________________________________
>  drbd-user mailing list
>  drbd-user at lists.linbit.com
>  http://lists.linbit.com/mailman/listinfo/drbd-user
>



More information about the drbd-user mailing list