Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> -----Ursprüngliche Nachricht-----
> [...]
> / 2006-11-27 09:29:23 +0100
> \ Saul, Markus:
> > > > nagios2:~# ping 192.168.1.1
> > > > PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
> > > > >From 192.168.1.2 icmp_seq=1 Destination Host Unreachable
> > >
> > > > Thanks for any hints,
> > >
> > > what broken network card / driver do you use?
> > > [...]
> >
> > We already exchanged the whole bunch of plugged in network cards
> > (before the above described tests there were similar problems, so I
> > thought it would be a good idea to change the NICs), the only thing
> > that stayed the same were the on-board ports (all in all we have 3
> > interfaces, I tried the private link on different ports, always with
> > problems). The new cards are Intel Pro GT 10/100/1000.
> > The drivers used:
> > (on-board)
> > via_rhine 20900 0
> > (PCI cards)
> > e1000 94548 0
>
> these at least do very fine, usually.
> does ifconfig report any collisioins/errors?
>
> this is not by chance some 64bit system
> with 32bit devices and "too much ram" ?
>
>[...]
Before and during the sync there are no errors.
After the sync abort, there are errors shown (on nagios2, the "receiving" side, nagios1 stays clean), when trying to ping nagios2:
eth3 Link encap:Ethernet HWaddr 00:0E:0C:B8:ED:30
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::20e:cff:feb8:ed30/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:688427 errors:203 dropped:203 overruns:203 frame:0
TX packets:587235 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:445101672 (424.4 MiB) TX bytes:121924981 (116.2 MiB)
Base address:0xd000 Memory:dbf80000-dbfa0000
This continues over time, as even the arp requests don't get back (nagios1 receives them and answers, but nagios2 doesn't receive them on the interface ).
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eth3:77961694 58441 3792 3792 3792 0 0 0 37175651 46245 0 0 0 0 0 0
It confirms the blocking state situation from the ping results.
System log shows following:
Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFConnection --> WFReportParams
Nov 27 11:04:18 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 27 11:04:18 nagios2 kernel: drbd0: Connection established.
Nov 27 11:04:18 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:00000737:00000026:01
Nov 27 11:04:18 nagios2 kernel: drbd0: Peer(P): 1:00000019:00000008:00000738:00000026:10
Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFReportParams --> WFBitMapT
Nov 27 11:04:18 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFBitMapT --> SyncTarget
Nov 27 11:04:18 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 15272600 KB [3818150 bits set]).
Nov 27 11:04:27 nagios2 kernel: TDH <5b>
Nov 27 11:04:27 nagios2 kernel: TDT <5b>
Nov 27 11:04:27 nagios2 kernel: next_to_use <5b>
Nov 27 11:04:27 nagios2 kernel: next_to_clean <6f>
Nov 27 11:04:27 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:27 nagios2 kernel: dma <aeb20ce>
Nov 27 11:04:27 nagios2 kernel: time_stamp <138d9f0a>
Nov 27 11:04:27 nagios2 kernel: next_to_watch <6f>
Nov 27 11:04:27 nagios2 kernel: jiffies <138da36d>
Nov 27 11:04:27 nagios2 kernel: next_to_watch.status <0>
Nov 27 11:04:29 nagios2 kernel: TDH <5b>
Nov 27 11:04:29 nagios2 kernel: TDT <5b>
Nov 27 11:04:29 nagios2 kernel: next_to_use <5b>
Nov 27 11:04:29 nagios2 kernel: next_to_clean <6f>
Nov 27 11:04:29 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:29 nagios2 kernel: dma <aeb20ce>
Nov 27 11:04:29 nagios2 kernel: time_stamp <138d9f0a>
Nov 27 11:04:29 nagios2 kernel: next_to_watch <6f>
Nov 27 11:04:29 nagios2 kernel: jiffies <138dab3d>
Nov 27 11:04:29 nagios2 kernel: next_to_watch.status <0>
Nov 27 11:04:31 nagios2 kernel: TDH <5b>
Nov 27 11:04:31 nagios2 kernel: TDT <5b>
Nov 27 11:04:31 nagios2 kernel: next_to_use <5b>
Nov 27 11:04:31 nagios2 kernel: next_to_clean <6f>
Nov 27 11:04:31 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:31 nagios2 kernel: dma <aeb20ce>
Nov 27 11:04:31 nagios2 kernel: time_stamp <138d9f0a>
Nov 27 11:04:31 nagios2 kernel: next_to_watch <6f>
Nov 27 11:04:31 nagios2 kernel: jiffies <138db30d>
Nov 27 11:04:31 nagios2 kernel: next_to_watch.status <0>
Nov 27 11:04:32 nagios2 kernel: NETDEV WATCHDOG: eth3: transmit timed out
Nov 27 11:04:34 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
Nov 27 11:04:45 nagios2 heartbeat: [4762]: info: Link nagios1:eth3 dead.
Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_lstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead]
Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_nstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead]
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_asender [28381]: cstate SyncTarget --> NetworkFailure
Nov 27 11:04:54 nagios2 kernel: drbd0: asender terminated
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate NetworkFailure --> BrokenPipe
Nov 27 11:04:54 nagios2 kernel: drbd0: short read receiving data block: read 2616 expected 4096
Nov 27 11:04:54 nagios2 kernel: drbd0: worker terminated
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate BrokenPipe --> Unconnected
Nov 27 11:04:54 nagios2 kernel: drbd0: Connection lost.
nagios2:~# ethtool -i eth3
driver: e1000
version: 6.1.16-k2
firmware-version: N/A
bus-info: 0000:00:0a.0
nagios2:~# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 2000+
stepping : 0
cpu MHz : 1665.772
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3335.64
nagios2:~# cat /proc/meminfo
MemTotal: 516696 kB
MemFree: 412368 kB
Buffers: 6400 kB
Cached: 39640 kB
Mainboard is ASRock K7VT4A Pro with PCI 2.2, although the network cards seem to have PCI 2.3, they should be downwards compatible.
I once again tried to use the on-board interfaces (I don't know how many times I switched connections, changed configs, cables and the cards, so something may be lost during the process) and it finished the sync on the purely on-board line. Don't know if this is stable as I said there were always problems wih something in the setup (even with cables
...). At least there are timeouts during resync ...not sure if this is a common issue, but at least it continued to sync.
nagios2, using on-board interface eth0:
Nov 27 12:36:00 nagios2 kernel: drbd0: resync bitmap: bits=9648394 words=301514
Nov 27 12:36:00 nagios2 kernel: drbd0: size = 36 GB (38593576 KB)
Nov 27 12:36:00 nagios2 kernel: drbd0: 14 GB marked out-of-sync by on disk bit-map.
Nov 27 12:36:00 nagios2 kernel: drbd0: Found 4 transactions (26 active extents) in activity log.
Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3965]: cstate Unconfigured --> StandAlone
Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3978]: cstate StandAlone --> Unconnected
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate Unconnected --> WFConnection
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFConnection --> WFReportParams
Nov 27 12:36:00 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 27 12:36:00 nagios2 kernel: drbd0: Connection established.
Nov 27 12:36:00 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:0000073a:00000026:01
Nov 27 12:36:00 nagios2 kernel: drbd0: Peer(P): 1:0000001a:00000008:0000073b:00000026:10
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFReportParams --> WFBitMapT
Nov 27 12:36:00 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Nov 27 12:36:01 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFBitMapT --> SyncTarget
Nov 27 12:36:01 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 14952748 KB [3738187 bits set]).
Nov 27 12:36:32 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Nov 27 12:36:32 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting...
Nov 27 12:36:32 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1
Nov 27 12:44:10 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Nov 27 12:44:10 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting...
Nov 27 12:44:10 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1
Nov 27 12:58:09 nagios2 kernel: drbd0: Resync done (total 1328 sec; paused 0 sec; 11256 K/sec)
Nov 27 12:58:09 nagios2 kernel: drbd0: drbd0_worker [3966]: cstate SyncTarget --> Connected
Nov 27 13:24:19 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Down
Nov 27 13:24:21 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
Guess I'll have to search for another solution exchanging the mainboard and/or compiling the newest e1000, or as worst case solution using the on-board for the drbd link, although the 1 GBit would be nicer than 100 MBit.
Thanks for the hints and tips,
Markus