Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-10-21 20:03:58 +0100 \ Matthew Hodgson: > Lars Ellenberg wrote: > > >>>/ 2004-10-07 13:46:15 +0200 > >>>\ Alex Ongena: > >>> > >>>>Hi, > >>>> > >>>>My master stays in WFReportParams forever due to a network > >>>>failure on my slave. > >>>> > >>>>Scenario: Master is running and is Primary, Slave is booting > >>>> > >>>>This is the relevant log: > >>> > >>>interssting. > >>>I miss the "handshake successful" message, though. > >>>anyways, this "should not happen". > >>> > >>>we'll have a look. > >>> > >>>what kernel is this, in case it matters? > >> > >>There is a new release 0.7.5 ... perhaps it fixes this? > > > >unlikely. > >that have been unrealted fixes, I think. > > I currently have a fileserver stuck with the same problem (I think) > running 0.7.5 on 2.4.27. The cluster is a pair of identical: > > # uname -a > Linux 2.4.27 #14 SMP Tue Oct 12 16:31:10 BST 2004 i686 unknown unknown > GNU/Linux > > vendor_id : GenuineIntel > cpu family : 15 > model : 2 > model name : Intel(R) Xeon(TM) CPU 2.80GHz > stepping : 5 > cpu MHz : 2793.076 > cache size : 512 KB > > MemTotal: 2068944 kB > > Intel(R) PRO/1000 Network Driver - version 5.4.11 > e1000: eth0, eth1: e1000_probe: Intel(R) PRO/1000 Network Connection > (using latest e1000-5.4.11 from Intel) > > Intel(R) PRO/100 Network Driver - version 2.3.43-k1 > e100: eth2: Intel(R) PRO/100 Network Connection > > ICH5: chipset revision 2 > > 3ware 9500s-8 SCSI-style IDE RAID Controller: > 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfe8ffc00, > 3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037, Ports: 8. > > 7x250G + 1 hot spare disks, RAID 5, so ~1.4T logical disk space per node. > > drbd: initialised. Version: 0.7.5 (api:76/proto:74) > drbd: SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22 > > eth0 and eth2 are bonded as bond0 and access the main LAN - eth1 however > is dedicated to drbd as a gigabit crossover segment direct to the other > node, on a 10.0.0.0/24 network. > > The nodes have been running using protocol C brought up by: > > drbdsetup /dev/drbd0 disk /dev/sda3 internal -1 > drbdsetup /dev/drbd0 primary > drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C > drbdsetup /dev/drbd0 syncer -r 512000 > > The shared device is a single 1.4T XFS partition. > > > The e1000 driver has been very flakey, with: > > NETDEV WATCHDOG: eth1: transmit timed out > drbd0: PingAck did not arrive in time. > drbd0: drbd0_asender [164]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: drbd0_receiver [163]: cstate NetworkFailure --> BrokenPipe > drbd0: short read expecting header on sock: r=-512 > drbd0: worker terminated > drbd0: drbd0_receiver [163]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > > appearing quite often on the master, associated with: > > e1000: eth1: e1000_watchdog: NIC Link is Down > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex > drbd0: meta connection shut down by peer. > drbd0: drbd0_asender [181]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: drbd0_receiver [180]: cstate NetworkFailure --> BrokenPipe > drbd0: short read receiving data block: read 1632 expected 4096 > drbd0: error receiving Data, l: 4112! > drbd0: worker terminated > drbd0: drbd0_receiver [180]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > > appearing on the slave. > > > Most recently however, this happened with the master reporting: > > NETDEV WATCHDOG: eth1: transmit timed out > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex > drbd0: PingAck did not arrive in time. > drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe > drbd0: short read expecting header on sock: r=-512 > drbd0: short sent UnplugRemote size=8 sent=-1001 > drbd0: worker terminated > drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection > drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams > > DRBD then hangs hard in WFReportParams mode. The underlying device is > still accessible, but the drbdsetup userland utils hang solid, and of > course replication is dead. > > Meanwhile on the slave: > > e1000: eth1: e1000_watchdog: NIC Link is Down > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex > drbd0: meta connection shut down by peer. > drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe > drbd0: short read receiving data block: read 924 expected 4096 > drbd0: error receiving Data, l: 4112! > drbd0: worker terminated > drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection > drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams > drbd0: sock_recvmsg returned -11 > drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe > drbd0: short read expecting header on sock: r=-11 > drbd0: Discarding network configuration. > > the slave's DRBD comes back up okay, but without the master and being > desynced, it's obviously useless. > > I assume my only option is to reboot the master to get out of this mess > - if there is any way to stop DRBD from sometimes doing this on network > failure it would be very much appreciated ;) hm. could you kick the kernel log daemon (klogd -i), and then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ? or at least give the output of /proc/drbd, and ps -eo pid,comm,stat,wchan ? Lars Ellenberg