Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> -----Ursprüngliche Nachricht----- > [...] > / 2006-11-27 09:29:23 +0100 > \ Saul, Markus: > > > > nagios2:~# ping 192.168.1.1 > > > > PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. > > > > >From 192.168.1.2 icmp_seq=1 Destination Host Unreachable > > > > > > > Thanks for any hints, > > > > > > what broken network card / driver do you use? > > > [...] > > > > We already exchanged the whole bunch of plugged in network cards > > (before the above described tests there were similar problems, so I > > thought it would be a good idea to change the NICs), the only thing > > that stayed the same were the on-board ports (all in all we have 3 > > interfaces, I tried the private link on different ports, always with > > problems). The new cards are Intel Pro GT 10/100/1000. > > The drivers used: > > (on-board) > > via_rhine 20900 0 > > (PCI cards) > > e1000 94548 0 > > these at least do very fine, usually. > does ifconfig report any collisioins/errors? > > this is not by chance some 64bit system > with 32bit devices and "too much ram" ? > >[...] Before and during the sync there are no errors. After the sync abort, there are errors shown (on nagios2, the "receiving" side, nagios1 stays clean), when trying to ping nagios2: eth3 Link encap:Ethernet HWaddr 00:0E:0C:B8:ED:30 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::20e:cff:feb8:ed30/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:688427 errors:203 dropped:203 overruns:203 frame:0 TX packets:587235 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:445101672 (424.4 MiB) TX bytes:121924981 (116.2 MiB) Base address:0xd000 Memory:dbf80000-dbfa0000 This continues over time, as even the arp requests don't get back (nagios1 receives them and answers, but nagios2 doesn't receive them on the interface ). Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth3:77961694 58441 3792 3792 3792 0 0 0 37175651 46245 0 0 0 0 0 0 It confirms the blocking state situation from the ping results. System log shows following: Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFConnection --> WFReportParams Nov 27 11:04:18 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Nov 27 11:04:18 nagios2 kernel: drbd0: Connection established. Nov 27 11:04:18 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:00000737:00000026:01 Nov 27 11:04:18 nagios2 kernel: drbd0: Peer(P): 1:00000019:00000008:00000738:00000026:10 Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFReportParams --> WFBitMapT Nov 27 11:04:18 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFBitMapT --> SyncTarget Nov 27 11:04:18 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 15272600 KB [3818150 bits set]). Nov 27 11:04:27 nagios2 kernel: TDH <5b> Nov 27 11:04:27 nagios2 kernel: TDT <5b> Nov 27 11:04:27 nagios2 kernel: next_to_use <5b> Nov 27 11:04:27 nagios2 kernel: next_to_clean <6f> Nov 27 11:04:27 nagios2 kernel: buffer_info[next_to_clean] Nov 27 11:04:27 nagios2 kernel: dma <aeb20ce> Nov 27 11:04:27 nagios2 kernel: time_stamp <138d9f0a> Nov 27 11:04:27 nagios2 kernel: next_to_watch <6f> Nov 27 11:04:27 nagios2 kernel: jiffies <138da36d> Nov 27 11:04:27 nagios2 kernel: next_to_watch.status <0> Nov 27 11:04:29 nagios2 kernel: TDH <5b> Nov 27 11:04:29 nagios2 kernel: TDT <5b> Nov 27 11:04:29 nagios2 kernel: next_to_use <5b> Nov 27 11:04:29 nagios2 kernel: next_to_clean <6f> Nov 27 11:04:29 nagios2 kernel: buffer_info[next_to_clean] Nov 27 11:04:29 nagios2 kernel: dma <aeb20ce> Nov 27 11:04:29 nagios2 kernel: time_stamp <138d9f0a> Nov 27 11:04:29 nagios2 kernel: next_to_watch <6f> Nov 27 11:04:29 nagios2 kernel: jiffies <138dab3d> Nov 27 11:04:29 nagios2 kernel: next_to_watch.status <0> Nov 27 11:04:31 nagios2 kernel: TDH <5b> Nov 27 11:04:31 nagios2 kernel: TDT <5b> Nov 27 11:04:31 nagios2 kernel: next_to_use <5b> Nov 27 11:04:31 nagios2 kernel: next_to_clean <6f> Nov 27 11:04:31 nagios2 kernel: buffer_info[next_to_clean] Nov 27 11:04:31 nagios2 kernel: dma <aeb20ce> Nov 27 11:04:31 nagios2 kernel: time_stamp <138d9f0a> Nov 27 11:04:31 nagios2 kernel: next_to_watch <6f> Nov 27 11:04:31 nagios2 kernel: jiffies <138db30d> Nov 27 11:04:31 nagios2 kernel: next_to_watch.status <0> Nov 27 11:04:32 nagios2 kernel: NETDEV WATCHDOG: eth3: transmit timed out Nov 27 11:04:34 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex Nov 27 11:04:45 nagios2 heartbeat: [4762]: info: Link nagios1:eth3 dead. Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_lstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead] Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_nstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead] Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_asender [28381]: cstate SyncTarget --> NetworkFailure Nov 27 11:04:54 nagios2 kernel: drbd0: asender terminated Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate NetworkFailure --> BrokenPipe Nov 27 11:04:54 nagios2 kernel: drbd0: short read receiving data block: read 2616 expected 4096 Nov 27 11:04:54 nagios2 kernel: drbd0: worker terminated Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate BrokenPipe --> Unconnected Nov 27 11:04:54 nagios2 kernel: drbd0: Connection lost. nagios2:~# ethtool -i eth3 driver: e1000 version: 6.1.16-k2 firmware-version: N/A bus-info: 0000:00:0a.0 nagios2:~# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 2000+ stepping : 0 cpu MHz : 1665.772 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3335.64 nagios2:~# cat /proc/meminfo MemTotal: 516696 kB MemFree: 412368 kB Buffers: 6400 kB Cached: 39640 kB Mainboard is ASRock K7VT4A Pro with PCI 2.2, although the network cards seem to have PCI 2.3, they should be downwards compatible. I once again tried to use the on-board interfaces (I don't know how many times I switched connections, changed configs, cables and the cards, so something may be lost during the process) and it finished the sync on the purely on-board line. Don't know if this is stable as I said there were always problems wih something in the setup (even with cables ...). At least there are timeouts during resync ...not sure if this is a common issue, but at least it continued to sync. nagios2, using on-board interface eth0: Nov 27 12:36:00 nagios2 kernel: drbd0: resync bitmap: bits=9648394 words=301514 Nov 27 12:36:00 nagios2 kernel: drbd0: size = 36 GB (38593576 KB) Nov 27 12:36:00 nagios2 kernel: drbd0: 14 GB marked out-of-sync by on disk bit-map. Nov 27 12:36:00 nagios2 kernel: drbd0: Found 4 transactions (26 active extents) in activity log. Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3965]: cstate Unconfigured --> StandAlone Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3978]: cstate StandAlone --> Unconnected Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate Unconnected --> WFConnection Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFConnection --> WFReportParams Nov 27 12:36:00 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Nov 27 12:36:00 nagios2 kernel: drbd0: Connection established. Nov 27 12:36:00 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:0000073a:00000026:01 Nov 27 12:36:00 nagios2 kernel: drbd0: Peer(P): 1:0000001a:00000008:0000073b:00000026:10 Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFReportParams --> WFBitMapT Nov 27 12:36:00 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Nov 27 12:36:01 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFBitMapT --> SyncTarget Nov 27 12:36:01 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 14952748 KB [3738187 bits set]). Nov 27 12:36:32 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 27 12:36:32 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting... Nov 27 12:36:32 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1 Nov 27 12:44:10 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 27 12:44:10 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting... Nov 27 12:44:10 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1 Nov 27 12:58:09 nagios2 kernel: drbd0: Resync done (total 1328 sec; paused 0 sec; 11256 K/sec) Nov 27 12:58:09 nagios2 kernel: drbd0: drbd0_worker [3966]: cstate SyncTarget --> Connected Nov 27 13:24:19 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Down Nov 27 13:24:21 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex Guess I'll have to search for another solution exchanging the mainboard and/or compiling the newest e1000, or as worst case solution using the on-board for the drbd link, although the 1 GBit would be nicer than 100 MBit. Thanks for the hints and tips, Markus