Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: >>>/ 2004-10-07 13:46:15 +0200 >>>\ Alex Ongena: >>> >>>>Hi, >>>> >>>>My master stays in WFReportParams forever due to a network >>>>failure on my slave. >>>> >>>>Scenario: Master is running and is Primary, Slave is booting >>>> >>>>This is the relevant log: >>> >>>interssting. >>>I miss the "handshake successful" message, though. >>>anyways, this "should not happen". >>> >>>we'll have a look. >>> >>>what kernel is this, in case it matters? >> >>There is a new release 0.7.5 ... perhaps it fixes this? > > unlikely. > that have been unrealted fixes, I think. I currently have a fileserver stuck with the same problem (I think) running 0.7.5 on 2.4.27. The cluster is a pair of identical: # uname -a Linux 2.4.27 #14 SMP Tue Oct 12 16:31:10 BST 2004 i686 unknown unknown GNU/Linux vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.80GHz stepping : 5 cpu MHz : 2793.076 cache size : 512 KB MemTotal: 2068944 kB Intel(R) PRO/1000 Network Driver - version 5.4.11 e1000: eth0, eth1: e1000_probe: Intel(R) PRO/1000 Network Connection (using latest e1000-5.4.11 from Intel) Intel(R) PRO/100 Network Driver - version 2.3.43-k1 e100: eth2: Intel(R) PRO/100 Network Connection ICH5: chipset revision 2 3ware 9500s-8 SCSI-style IDE RAID Controller: 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfe8ffc00, 3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037, Ports: 8. 7x250G + 1 hot spare disks, RAID 5, so ~1.4T logical disk space per node. drbd: initialised. Version: 0.7.5 (api:76/proto:74) drbd: SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22 eth0 and eth2 are bonded as bond0 and access the main LAN - eth1 however is dedicated to drbd as a gigabit crossover segment direct to the other node, on a 10.0.0.0/24 network. The nodes have been running using protocol C brought up by: drbdsetup /dev/drbd0 disk /dev/sda3 internal -1 drbdsetup /dev/drbd0 primary drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C drbdsetup /dev/drbd0 syncer -r 512000 The shared device is a single 1.4T XFS partition. The e1000 driver has been very flakey, with: NETDEV WATCHDOG: eth1: transmit timed out drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [164]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [163]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: worker terminated drbd0: drbd0_receiver [163]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. appearing quite often on the master, associated with: e1000: eth1: e1000_watchdog: NIC Link is Down e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex drbd0: meta connection shut down by peer. drbd0: drbd0_asender [181]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [180]: cstate NetworkFailure --> BrokenPipe drbd0: short read receiving data block: read 1632 expected 4096 drbd0: error receiving Data, l: 4112! drbd0: worker terminated drbd0: drbd0_receiver [180]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. appearing on the slave. Most recently however, this happened with the master reporting: NETDEV WATCHDOG: eth1: transmit timed out drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: short sent UnplugRemote size=8 sent=-1001 drbd0: worker terminated drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams DRBD then hangs hard in WFReportParams mode. The underlying device is still accessible, but the drbdsetup userland utils hang solid, and of course replication is dead. Meanwhile on the slave: e1000: eth1: e1000_watchdog: NIC Link is Down e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex drbd0: meta connection shut down by peer. drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe drbd0: short read receiving data block: read 924 expected 4096 drbd0: error receiving Data, l: 4112! drbd0: worker terminated drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams drbd0: sock_recvmsg returned -11 drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe drbd0: short read expecting header on sock: r=-11 drbd0: Discarding network configuration. the slave's DRBD comes back up okay, but without the master and being desynced, it's obviously useless. I assume my only option is to reboot the master to get out of this mess - if there is any way to stop DRBD from sometimes doing this on network failure it would be very much appreciated ;) Also, any thoughts on what might cause such horrible network/DRBD flakiness in the first place would be very gratefully received - are there known clashes between the e100 & e1000 drivers? Or with bonding? Or the 3ware RAID card or even using XFS? thanks in advance, Matthew. -- ______________________________________________________________ Matthew Hodgson matthew at mxtelecom.com Tel: +44 845 6667778 Systems Analyst, MX Telecom Ltd.