[DRBD-user] 0.7.4 in state WFReportParams forever ?

Thu Oct 21 21:03:58 CEST 2004

Lars Ellenberg wrote:

>>>/ 2004-10-07 13:46:15 +0200
>>>\ Alex Ongena:
>>>
>>>>Hi,
>>>>
>>>>My master stays in WFReportParams forever due to a network
>>>>failure on my slave.
>>>>
>>>>Scenario: Master is running and is Primary, Slave is booting
>>>>
>>>>This is the relevant log:
>>>
>>>interssting.
>>>I miss the "handshake successful" message, though.
>>>anyways, this "should not happen".
>>>
>>>we'll have a look.
>>>
>>>what kernel is this, in case it matters?
>>
>>There is a new release 0.7.5 ... perhaps it fixes this?
> 
> unlikely.
> that have been unrealted fixes, I think.

I currently have a fileserver stuck with the same problem (I think) 
running 0.7.5 on 2.4.27.  The cluster is a pair of identical:

# uname -a
Linux 2.4.27 #14 SMP Tue Oct 12 16:31:10 BST 2004 i686 unknown unknown 
GNU/Linux

vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 5
cpu MHz         : 2793.076
cache size      : 512 KB

MemTotal:      2068944 kB

Intel(R) PRO/1000 Network Driver - version 5.4.11
e1000: eth0, eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
(using latest e1000-5.4.11 from Intel)

Intel(R) PRO/100 Network Driver - version 2.3.43-k1
e100: eth2: Intel(R) PRO/100 Network Connection

ICH5: chipset revision 2

3ware 9500s-8 SCSI-style IDE RAID Controller:
3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfe8ffc00, 
3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037, Ports: 8.

7x250G + 1 hot spare disks, RAID 5, so ~1.4T logical disk space per node.

drbd: initialised. Version: 0.7.5 (api:76/proto:74)
drbd: SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22

eth0 and eth2 are bonded as bond0 and access the main LAN - eth1 however 
is dedicated to drbd as a gigabit crossover segment direct to the other 
node, on a 10.0.0.0/24 network.

The nodes have been running using protocol C brought up by:

drbdsetup /dev/drbd0 disk /dev/sda3 internal -1
drbdsetup /dev/drbd0 primary
drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C
drbdsetup /dev/drbd0 syncer -r 512000

The shared device is a single 1.4T XFS partition.

The e1000 driver has been very flakey, with:

NETDEV WATCHDOG: eth1: transmit timed out
drbd0: PingAck did not arrive in time.
drbd0: drbd0_asender [164]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [163]: cstate NetworkFailure --> BrokenPipe
drbd0: short read expecting header on sock: r=-512
drbd0: worker terminated
drbd0: drbd0_receiver [163]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.

appearing quite often on the master, associated with:

e1000: eth1: e1000_watchdog: NIC Link is Down
e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
drbd0: meta connection shut down by peer.
drbd0: drbd0_asender [181]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [180]: cstate NetworkFailure --> BrokenPipe
drbd0: short read receiving data block: read 1632 expected 4096
drbd0: error receiving Data, l: 4112!
drbd0: worker terminated
drbd0: drbd0_receiver [180]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.

appearing on the slave.

Most recently however, this happened with the master reporting:

NETDEV WATCHDOG: eth1: transmit timed out
drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
drbd0: PingAck did not arrive in time.
drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe
drbd0: short read expecting header on sock: r=-512
drbd0: short sent UnplugRemote size=8 sent=-1001
drbd0: worker terminated
drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams

DRBD then hangs hard in WFReportParams mode.  The underlying device is 
still accessible, but the drbdsetup userland utils hang solid, and of 
course replication is dead.

Meanwhile on the slave:

e1000: eth1: e1000_watchdog: NIC Link is Down
e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
drbd0: meta connection shut down by peer.
drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe
drbd0: short read receiving data block: read 924 expected 4096
drbd0: error receiving Data, l: 4112!
drbd0: worker terminated
drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams
drbd0: sock_recvmsg returned -11
drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe
drbd0: short read expecting header on sock: r=-11
drbd0: Discarding network configuration.

the slave's DRBD comes back up okay, but without the master and being 
desynced, it's obviously useless.

I assume my only option is to reboot the master to get out of this mess 
- if there is any way to stop DRBD from sometimes doing this on network 
failure it would be very much appreciated ;)

Also, any thoughts on what might cause such horrible network/DRBD 
flakiness in the first place would be very gratefully received - are 
there known clashes between the e100 & e1000 drivers?  Or with bonding? 
  Or the 3ware RAID card or even using XFS?

thanks in advance,

Matthew.

-- 
______________________________________________________________
Matthew Hodgson   matthew at mxtelecom.com   Tel: +44 845 6667778
                 Systems Analyst, MX Telecom Ltd.