[DRBD-user] Problem with private link and DRBD sync process

Saul, Markus Markus.Saul at danet.de
Mon Nov 27 13:52:04 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


 > -----Ursprüngliche Nachricht-----
> [...]
> / 2006-11-27 09:29:23 +0100
> \ Saul, Markus:
> > > > nagios2:~# ping 192.168.1.1
> > > > PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
> > > > >From 192.168.1.2 icmp_seq=1 Destination Host Unreachable
> > > 
> > > > Thanks for any hints,
> > > 
> > > what broken network card / driver do you use?
> > > [...]
> > 
> > We already exchanged the whole bunch of plugged in network cards
> > (before the above described tests there were similar problems, so I
> > thought it would be a good idea to change the NICs), the only thing
> > that stayed the same were the on-board ports (all in all we have 3
> > interfaces, I tried the private link on different ports, always with
> > problems). The new cards are Intel Pro GT 10/100/1000. 
> > The drivers used:
> > (on-board)
> > via_rhine              20900  0
> > (PCI cards)
> > e1000                  94548  0
> 
> these at least do very fine, usually.
> does ifconfig report any collisioins/errors?
> 
> this is not by chance some 64bit system
> with 32bit devices and "too much ram" ?
> 
>[...] 

Before and during the sync there are no errors.
After the sync abort, there are errors shown (on nagios2, the "receiving" side, nagios1 stays clean), when trying to ping nagios2:

eth3      Link encap:Ethernet  HWaddr 00:0E:0C:B8:ED:30
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:feb8:ed30/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:688427 errors:203 dropped:203 overruns:203 frame:0
          TX packets:587235 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:445101672 (424.4 MiB)  TX bytes:121924981 (116.2 MiB)
          Base address:0xd000 Memory:dbf80000-dbfa0000

This continues over time, as even the arp requests don't get back (nagios1 receives them and answers, but nagios2 doesn't receive them on the interface ).

 Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  eth3:77961694   58441 3792 3792 3792     0          0         0 37175651   46245    0    0    0     0       0          0

It confirms the blocking state situation from the ping results.
System log shows following:

Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFConnection --> WFReportParams
Nov 27 11:04:18 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 27 11:04:18 nagios2 kernel: drbd0: Connection established.
Nov 27 11:04:18 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:00000737:00000026:01
Nov 27 11:04:18 nagios2 kernel: drbd0: Peer(P): 1:00000019:00000008:00000738:00000026:10
Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFReportParams --> WFBitMapT
Nov 27 11:04:18 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Nov 27 11:04:18 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate WFBitMapT --> SyncTarget
Nov 27 11:04:18 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 15272600 KB [3818150 bits set]).
Nov 27 11:04:27 nagios2 kernel:   TDH                  <5b>
Nov 27 11:04:27 nagios2 kernel:   TDT                  <5b>
Nov 27 11:04:27 nagios2 kernel:   next_to_use          <5b>
Nov 27 11:04:27 nagios2 kernel:   next_to_clean        <6f>
Nov 27 11:04:27 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:27 nagios2 kernel:   dma                  <aeb20ce>
Nov 27 11:04:27 nagios2 kernel:   time_stamp           <138d9f0a>
Nov 27 11:04:27 nagios2 kernel:   next_to_watch        <6f>
Nov 27 11:04:27 nagios2 kernel:   jiffies              <138da36d>
Nov 27 11:04:27 nagios2 kernel:   next_to_watch.status <0>
Nov 27 11:04:29 nagios2 kernel:   TDH                  <5b>
Nov 27 11:04:29 nagios2 kernel:   TDT                  <5b>
Nov 27 11:04:29 nagios2 kernel:   next_to_use          <5b>
Nov 27 11:04:29 nagios2 kernel:   next_to_clean        <6f>
Nov 27 11:04:29 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:29 nagios2 kernel:   dma                  <aeb20ce>
Nov 27 11:04:29 nagios2 kernel:   time_stamp           <138d9f0a>
Nov 27 11:04:29 nagios2 kernel:   next_to_watch        <6f>
Nov 27 11:04:29 nagios2 kernel:   jiffies              <138dab3d>
Nov 27 11:04:29 nagios2 kernel:   next_to_watch.status <0>
Nov 27 11:04:31 nagios2 kernel:   TDH                  <5b>
Nov 27 11:04:31 nagios2 kernel:   TDT                  <5b>
Nov 27 11:04:31 nagios2 kernel:   next_to_use          <5b>
Nov 27 11:04:31 nagios2 kernel:   next_to_clean        <6f>
Nov 27 11:04:31 nagios2 kernel: buffer_info[next_to_clean]
Nov 27 11:04:31 nagios2 kernel:   dma                  <aeb20ce>
Nov 27 11:04:31 nagios2 kernel:   time_stamp           <138d9f0a>
Nov 27 11:04:31 nagios2 kernel:   next_to_watch        <6f>
Nov 27 11:04:31 nagios2 kernel:   jiffies              <138db30d>
Nov 27 11:04:31 nagios2 kernel:   next_to_watch.status <0>
Nov 27 11:04:32 nagios2 kernel: NETDEV WATCHDOG: eth3: transmit timed out
Nov 27 11:04:34 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
Nov 27 11:04:45 nagios2 heartbeat: [4762]: info: Link nagios1:eth3 dead.
Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_lstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead]
Nov 27 11:04:45 nagios2 pingd: [5236]: notice: pingd_nstatus_callback:pingd.c Status update: Ping node nagios1 now has status [dead]
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_asender [28381]: cstate SyncTarget --> NetworkFailure
Nov 27 11:04:54 nagios2 kernel: drbd0: asender terminated
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate NetworkFailure --> BrokenPipe
Nov 27 11:04:54 nagios2 kernel: drbd0: short read receiving data block: read 2616 expected 4096
Nov 27 11:04:54 nagios2 kernel: drbd0: worker terminated
Nov 27 11:04:54 nagios2 kernel: drbd0: drbd0_receiver [28359]: cstate BrokenPipe --> Unconnected
Nov 27 11:04:54 nagios2 kernel: drbd0: Connection lost.

nagios2:~# ethtool -i eth3
driver: e1000
version: 6.1.16-k2
firmware-version: N/A
bus-info: 0000:00:0a.0

nagios2:~# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 2000+
stepping        : 0
cpu MHz         : 1665.772
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 3335.64

nagios2:~# cat /proc/meminfo
MemTotal:       516696 kB
MemFree:        412368 kB
Buffers:          6400 kB
Cached:          39640 kB


Mainboard is ASRock K7VT4A Pro with PCI 2.2, although the network cards seem to have PCI 2.3, they should be downwards compatible. 


I once again tried to use the on-board interfaces (I don't know how many times I switched connections, changed configs, cables and the cards, so something may be lost during the process) and it finished the sync on the purely on-board line. Don't know if this is stable as I said there were always problems wih something in the setup (even with cables
...). At least there are timeouts during resync ...not sure if this is a common issue, but at least it continued to sync.

nagios2, using on-board interface eth0:

Nov 27 12:36:00 nagios2 kernel: drbd0: resync bitmap: bits=9648394 words=301514
Nov 27 12:36:00 nagios2 kernel: drbd0: size = 36 GB (38593576 KB)
Nov 27 12:36:00 nagios2 kernel: drbd0: 14 GB marked out-of-sync by on disk bit-map.
Nov 27 12:36:00 nagios2 kernel: drbd0: Found 4 transactions (26 active extents) in activity log.
Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3965]: cstate Unconfigured --> StandAlone
Nov 27 12:36:00 nagios2 kernel: drbd0: drbdsetup [3978]: cstate StandAlone --> Unconnected
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate Unconnected --> WFConnection
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFConnection --> WFReportParams
Nov 27 12:36:00 nagios2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 27 12:36:00 nagios2 kernel: drbd0: Connection established.
Nov 27 12:36:00 nagios2 kernel: drbd0: I am(S): 0:00000019:00000008:0000073a:00000026:01
Nov 27 12:36:00 nagios2 kernel: drbd0: Peer(P): 1:0000001a:00000008:0000073b:00000026:10
Nov 27 12:36:00 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFReportParams --> WFBitMapT
Nov 27 12:36:00 nagios2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Nov 27 12:36:01 nagios2 kernel: drbd0: drbd0_receiver [3979]: cstate WFBitMapT --> SyncTarget
Nov 27 12:36:01 nagios2 kernel: drbd0: Resync started as SyncTarget (need to sync 14952748 KB [3738187 bits set]).
Nov 27 12:36:32 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Nov 27 12:36:32 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting...
Nov 27 12:36:32 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1
Nov 27 12:44:10 nagios2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Nov 27 12:44:10 nagios2 kernel: eth0: Transmit timed out, status 0000, PHY status 786d, resetting...
Nov 27 12:44:10 nagios2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45C1
Nov 27 12:58:09 nagios2 kernel: drbd0: Resync done (total 1328 sec; paused 0 sec; 11256 K/sec)
Nov 27 12:58:09 nagios2 kernel: drbd0: drbd0_worker [3966]: cstate SyncTarget --> Connected
Nov 27 13:24:19 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Down
Nov 27 13:24:21 nagios2 kernel: e1000: eth3: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex


Guess I'll have to search for another solution exchanging the mainboard and/or compiling the newest e1000, or as worst case solution using the on-board for the drbd link, although the 1 GBit would be nicer than 100 MBit. 

Thanks for the hints and tips,

 Markus






More information about the drbd-user mailing list