Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 4/4/08, Florian Haas <florian.haas at linbit.com> wrote: > On Friday 04 April 2008 15:18:21 George H wrote: > > OK I upgraded both my blades to the latest stable kernel 2.6.24. > > Rebuilt drbd 8.0.8 and restarted the sync. > > > > I noticed off hand the connection to get sync was quicker than before > > > > Apr 4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS ) > > Apr 4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource ) > > > > normally it used to take 5 or more minutes. > > > > But as that was quick.. so was the "network failure" see below > > > > Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block now. > > Apr 4 15:49:35 mailserv1 drbd0: Becoming sync source due to disk states. > > Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block now. > > Apr 4 15:49:35 mailserv1 drbd0: writing of bitmap took 7 jiffies > > Apr 4 15:49:35 mailserv1 drbd0: 476 GB (124997941 bits) marked > > out-of-sync by on disk bit-map. > > Apr 4 15:49:35 mailserv1 drbd0: Writing meta data super block now. > > Apr 4 15:49:35 mailserv1 drbd0: conn( Connected -> WFBitMapS ) > > Apr 4 15:50:18 mailserv1 drbd0: conn( WFBitMapS -> SyncSource ) > > Apr 4 15:50:18 mailserv1 drbd0: Began resync as SyncSource (will sync > > 499991764 KB [124997941 bits set]). > > Apr 4 15:50:18 mailserv1 drbd0: Writing meta data super block now. > > Apr 4 16:03:26 mailserv1 drbd0: PingAck did not arrive in time. > > Apr 4 16:03:26 mailserv1 drbd0: peer( Secondary -> Unknown ) conn( > > SyncSource -> NetworkFailure ) > > Apr 4 16:03:26 mailserv1 drbd0: asender terminated > > Apr 4 16:03:26 mailserv1 drbd0: drbd_pp_alloc interrupted! > > Apr 4 16:03:26 mailserv1 drbd0: alloc_ee: Allocation of a page failed > > Apr 4 16:03:26 mailserv1 drbd0: error receiving RSDataRequest, l: 24! > > Apr 4 16:03:26 mailserv1 drbd0: tl_clear() > > Apr 4 16:03:26 mailserv1 drbd0: Connection closed > > Apr 4 16:03:26 mailserv1 drbd0: Writing meta data super block now. > > Apr 4 16:03:26 mailserv1 drbd0: conn( NetworkFailure -> Unconnected ) > > Apr 4 16:03:26 mailserv1 drbd0: receiver terminated > > Apr 4 16:03:26 mailserv1 drbd0: receiver (re)started > > Apr 4 16:03:26 mailserv1 drbd0: conn( Unconnected -> WFConnection ) > > Apr 4 16:03:26 mailserv1 drbd0: Handshake successful: DRBD Network > > Protocol version 86 > > > OK so that's a very quick disconnection and subsequent reconnection. How often > does that occur? Do you ever get network interruptions for longer periods? > When you do, what does "tcpdump -i <your replication interface>" say? This disconnection happens often. Right now it happens every 10-20 minutes. We don't get network interruptions at all. On monday i'm going to try to connect the two blade chassis via a cross over link completely excluding the switch just to see if the switch is the problem. I got the tcpdump log of the entire sync session up to the failure. It's huge and I don't know what I'm supposed to look for in it. I 'grep'ed out the timeslot where the pingAck occured. it all looks alien to me. What am I looking for in the tcpdump logs? Thanks > I strongly suspect at this point all your DRBD tuning efforts, while > admirable, are futile. You really need to fix your network stack first. > > > Cheers, > Florian > > -- > : Florian G. Haas > : LINBIT Information Technologies GmbH > : Vivenotgasse 48, A-1120 Vienna, Austria > > When replying, there is no need to CC my personal address. > I monitor the list on a daily basis. Thank you. > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user >