Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Friday 04 April 2008 16:50:12 George H wrote: > On 4/4/08, Florian Haas <florian.haas at linbit.com> wrote: > > On Friday 04 April 2008 15:18:21 George H wrote: > > > OK I upgraded both my blades to the latest stable kernel 2.6.24. > > > Rebuilt drbd 8.0.8 and restarted the sync. > > > > > > I noticed off hand the connection to get sync was quicker than before > > > > > > Apr 4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS ) > > > Apr 4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource ) > > > > > > normally it used to take 5 or more minutes. > > > > > > But as that was quick.. so was the "network failure" see below > > > > > > Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block now. > > > Apr 4 15:49:35 mailserv1 drbd0: Becoming sync source due to disk > > > states. Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block > > > now. Apr 4 15:49:35 mailserv1 drbd0: writing of bitmap took 7 jiffies > > > Apr 4 15:49:35 mailserv1 drbd0: 476 GB (124997941 bits) marked > > > out-of-sync by on disk bit-map. > > > Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block now. > > > Apr 4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS ) > > > Apr 4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource ) > > > Apr 4 15:50:18 mailserv1 drbd0: Began resync as SyncSource (will sync > > > 499991764 KB [124997941 bits set]). > > > Apr 4 15:50:18 mailserv1 drbd0: Writing meta data super block now. > > > Apr 4 16:03:26 mailserv1 drbd0: PingAck did not arrive in time. > > > Apr 4 16:03:26 mailserv1 drbd0: peer( Secondary -> Unknown ) conn( > > > SyncSource -> NetworkFailure ) > > > Apr 4 16:03:26 mailserv1 drbd0: asender terminated > > > Apr 4 16:03:26 mailserv1 drbd0: drbd_pp_alloc interrupted! > > > Apr 4 16:03:26 mailserv1 drbd0: alloc_ee: Allocation of a page failed > > > Apr 4 16:03:26 mailserv1 drbd0: error receiving RSDataRequest, l: 24! > > > Apr 4 16:03:26 mailserv1 drbd0: tl_clear() > > > Apr 4 16:03:26 mailserv1 drbd0: Connection closed > > > Apr 4 16:03:26 mailserv1 drbd0: Writing meta data super block now. > > > Apr 4 16:03:26 mailserv1 drbd0: conn( NetworkFailure -> Unconnected ) > > > Apr 4 16:03:26 mailserv1 drbd0: receiver terminated > > > Apr 4 16:03:26 mailserv1 drbd0: receiver (re)started > > > Apr 4 16:03:26 mailserv1 drbd0: conn( Unconnected -> WFConnection ) > > > Apr 4 16:03:26 mailserv1 drbd0: Handshake successful: DRBD Network > > > Protocol version 86 > > > > OK so that's a very quick disconnection and subsequent reconnection. How > > often does that occur? Do you ever get network interruptions for longer > > periods? When you do, what does "tcpdump -i <your replication interface>" > > say? > > This disconnection happens often. Right now it happens every 10-20 > minutes. We don't get network interruptions at all. Oh yes, on your replication link you certainly do. At least your log excerpt says so. > On monday i'm > going to try to connect the two blade chassis via a cross over link > completely excluding the switch just to see if the switch is the > problem. Good call. > I got the tcpdump log of the entire sync session up to the failure. That is probably of little use. It would only be interesting to see what happens while DRBD is forcefully disconnect (unanswered ARP requests, etc.). > It's huge and I don't know what I'm supposed to look for in it. I > 'grep'ed out the timeslot where the pingAck occured. it all looks > alien to me. No offense please, but I suggest you bring someone in to whom a packet trace doesn't look alien. If it does to you, you're going to have a very hard time troubleshooting your network stack. Cheers, Florian -- : Florian G. Haas : LINBIT Information Technologies GmbH : Vivenotgasse 48, A-1120 Vienna, Austria When replying, there is no need to CC my personal address. I monitor the list on a daily basis. Thank you.