[DRBD-user] 0.7.4 in state WFReportParams forever ?

Thu Oct 21 21:43:23 CEST 2004

/ 2004-10-21 20:03:58 +0100
\ Matthew Hodgson:
> Lars Ellenberg wrote:
> 
> >>>/ 2004-10-07 13:46:15 +0200
> >>>\ Alex Ongena:
> >>>
> >>>>Hi,
> >>>>
> >>>>My master stays in WFReportParams forever due to a network
> >>>>failure on my slave.
> >>>>
> >>>>Scenario: Master is running and is Primary, Slave is booting
> >>>>
> >>>>This is the relevant log:
> >>>
> >>>interssting.
> >>>I miss the "handshake successful" message, though.
> >>>anyways, this "should not happen".
> >>>
> >>>we'll have a look.
> >>>
> >>>what kernel is this, in case it matters?
> >>
> >>There is a new release 0.7.5 ... perhaps it fixes this?
> >
> >unlikely.
> >that have been unrealted fixes, I think.
> 
> I currently have a fileserver stuck with the same problem (I think) 
> running 0.7.5 on 2.4.27.  The cluster is a pair of identical:
> 
> # uname -a
> Linux 2.4.27 #14 SMP Tue Oct 12 16:31:10 BST 2004 i686 unknown unknown 
> GNU/Linux
> 
> vendor_id       : GenuineIntel
> cpu family      : 15
> model           : 2
> model name      : Intel(R) Xeon(TM) CPU 2.80GHz
> stepping        : 5
> cpu MHz         : 2793.076
> cache size      : 512 KB
> 
> MemTotal:      2068944 kB
> 
> Intel(R) PRO/1000 Network Driver - version 5.4.11
> e1000: eth0, eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
> (using latest e1000-5.4.11 from Intel)
> 
> Intel(R) PRO/100 Network Driver - version 2.3.43-k1
> e100: eth2: Intel(R) PRO/100 Network Connection
> 
> ICH5: chipset revision 2
> 
> 3ware 9500s-8 SCSI-style IDE RAID Controller:
> 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfe8ffc00, 
> 3w-9xxx: scsi0: Firmware FE9X 2.02.00.012, BIOS BE9X 2.02.01.037, Ports: 8.
> 
> 7x250G + 1 hot spare disks, RAID 5, so ~1.4T logical disk space per node.
> 
> drbd: initialised. Version: 0.7.5 (api:76/proto:74)
> drbd: SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22
> 
> eth0 and eth2 are bonded as bond0 and access the main LAN - eth1 however 
> is dedicated to drbd as a gigabit crossover segment direct to the other 
> node, on a 10.0.0.0/24 network.
> 
> The nodes have been running using protocol C brought up by:
> 
> drbdsetup /dev/drbd0 disk /dev/sda3 internal -1
> drbdsetup /dev/drbd0 primary
> drbdsetup /dev/drbd0 net 10.0.0.2:7788 10.0.0.1:7788 C
> drbdsetup /dev/drbd0 syncer -r 512000
> 
> The shared device is a single 1.4T XFS partition.
> 
> 
> The e1000 driver has been very flakey, with:
> 
> NETDEV WATCHDOG: eth1: transmit timed out
> drbd0: PingAck did not arrive in time.
> drbd0: drbd0_asender [164]: cstate Connected --> NetworkFailure
> drbd0: asender terminated
> drbd0: drbd0_receiver [163]: cstate NetworkFailure --> BrokenPipe
> drbd0: short read expecting header on sock: r=-512
> drbd0: worker terminated
> drbd0: drbd0_receiver [163]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> 
> appearing quite often on the master, associated with:
> 
> e1000: eth1: e1000_watchdog: NIC Link is Down
> e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
> drbd0: meta connection shut down by peer.
> drbd0: drbd0_asender [181]: cstate Connected --> NetworkFailure
> drbd0: asender terminated
> drbd0: drbd0_receiver [180]: cstate NetworkFailure --> BrokenPipe
> drbd0: short read receiving data block: read 1632 expected 4096
> drbd0: error receiving Data, l: 4112!
> drbd0: worker terminated
> drbd0: drbd0_receiver [180]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> 
> appearing on the slave.
> 
> 
> Most recently however, this happened with the master reporting:
> 
> NETDEV WATCHDOG: eth1: transmit timed out
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
> e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
> drbd0: PingAck did not arrive in time.
> drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure
> drbd0: asender terminated
> drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe
> drbd0: short read expecting header on sock: r=-512
> drbd0: short sent UnplugRemote size=8 sent=-1001
> drbd0: worker terminated
> drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams
> 
> DRBD then hangs hard in WFReportParams mode.  The underlying device is 
> still accessible, but the drbdsetup userland utils hang solid, and of 
> course replication is dead.
> 
> Meanwhile on the slave:
> 
> e1000: eth1: e1000_watchdog: NIC Link is Down
> e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
> drbd0: meta connection shut down by peer.
> drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure
> drbd0: asender terminated
> drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe
> drbd0: short read receiving data block: read 924 expected 4096
> drbd0: error receiving Data, l: 4112!
> drbd0: worker terminated
> drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams
> drbd0: sock_recvmsg returned -11
> drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe
> drbd0: short read expecting header on sock: r=-11
> drbd0: Discarding network configuration.
> 
> the slave's DRBD comes back up okay, but without the master and being 
> desynced, it's obviously useless.
> 
> I assume my only option is to reboot the master to get out of this mess 
> - if there is any way to stop DRBD from sometimes doing this on network 
> failure it would be very much appreciated ;)

hm.
could you kick the kernel log daemon (klogd -i), and
then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ?

or at least give the output of /proc/drbd, and
ps -eo pid,comm,stat,wchan ?

	Lars Ellenberg