Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: > / 2004-10-21 20:03:58 +0100 > \ Matthew Hodgson: > >>Most recently however, this happened with the master reporting: >> >>NETDEV WATCHDOG: eth1: transmit timed out >>drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 >>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex >>drbd0: PingAck did not arrive in time. >>drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure >>drbd0: asender terminated >>drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe >>drbd0: short read expecting header on sock: r=-512 >>drbd0: short sent UnplugRemote size=8 sent=-1001 >>drbd0: worker terminated >>drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected >>drbd0: Connection lost. >>drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection >>drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams >> >>DRBD then hangs hard in WFReportParams mode. The underlying device is >>still accessible, but the drbdsetup userland utils hang solid, and of >>course replication is dead. >> >>Meanwhile on the slave: >> >>e1000: eth1: e1000_watchdog: NIC Link is Down >>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex >>drbd0: meta connection shut down by peer. >>drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure >>drbd0: asender terminated >>drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe >>drbd0: short read receiving data block: read 924 expected 4096 >>drbd0: error receiving Data, l: 4112! >>drbd0: worker terminated >>drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected >>drbd0: Connection lost. >>drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection >>drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams >>drbd0: sock_recvmsg returned -11 >>drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe >>drbd0: short read expecting header on sock: r=-11 >>drbd0: Discarding network configuration. >> >>the slave's DRBD comes back up okay, but without the master and being >>desynced, it's obviously useless. >> >>I assume my only option is to reboot the master to get out of this mess >>- if there is any way to stop DRBD from sometimes doing this on network >>failure it would be very much appreciated ;) > > > > hm. > could you kick the kernel log daemon (klogd -i), and > then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ? Apologies for the width & length of information here - i haven't truncated it as I'm not sure what might be obliquely relavent and what isn't: # klogd -i # echo t > /proc/sysrq-trigger # cat /var/log/kern.log Oct 22 12:49:46 kernel: SysRq : Show State Oct 22 12:49:46 kernel: Oct 22 12:49:46 kernel: free sibling Oct 22 12:49:46 kernel: task PC stack pid father child younger older Oct 22 12:49:46 kernel: init S C036B8A0 3868 1 0 6151 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c08>] [<c013724c>] [<c0116b50>] [<c0149da7>] Oct 22 12:49:46 kernel: [<c0150729>] [<c0150bc9>] [<c0147b09>] [<c010723b>] Oct 22 12:49:46 kernel: keventd S EC35BED4 6032 2 1 3 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0128711>] [<c0128600>] [<c0105000>] [<c010578e>] [<c0128600>] Oct 22 12:49:46 kernel: ksoftirqd_CPU S 00000000 4528 3 1 4 2 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c011f93c>] [<c0105000>] [<c010578e>] [<c011f850>] Oct 22 12:49:46 kernel: kswapd S C2874000 4928 4 1 5 3 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0136537>] [<c0105000>] [<c010578e>] [<c0136480>] Oct 22 12:49:46 kernel: bdflush S CA238B80 4132 5 1 6 4 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c01175fe>] [<c0144347>] [<c0105000>] [<c010578e>] [<c0144260>] Oct 22 12:49:46 kernel: kupdated S F7A5D900 3164 6 1 7 5 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c014443d>] [<c01071ee>] [<c0144360>] Oct 22 12:49:46 kernel: [<c0105000>] [<c010578e>] [<c0144360>] Oct 22 12:49:46 kernel: xfsbufd S F7AF3880 4272 7 1 8 6 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c011f83a>] [<c02074e7>] [<c0105000>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0207440>] Oct 22 12:49:46 kernel: xfslogd/0 S C0495644 4764 8 1 9 7 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c02072ea>] [<c0105000>] [<c02073d3>] [<c010578e>] [<c02073a0>] Oct 22 12:49:46 kernel: xfsdatad/0 S C2842000 6188 9 1 10 8 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c02072ea>] [<c0105000>] [<c0207413>] [<c010578e>] [<c02073e0>] Oct 22 12:49:46 kernel: scsi_eh_0 S C0367EE0 6220 10 1 11 9 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0105de9>] [<c0105eb7>] [<c0281f9c>] [<c010578e>] [<c0281cd0>] Oct 22 12:49:46 kernel: mdrecoveryd S C2834000 0 11 1 58 10 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c02a9077>] [<c010578e>] [<c02a8ee0>] Oct 22 12:49:46 kernel: syslogd S C010678F 1376 58 1 61 11 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c010678f>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>] Oct 22 12:49:46 kernel: [<c0150bc9>] [<c011e82b>] [<c010723b>] Oct 22 12:49:46 kernel: klogd R 00000000 4692 61 1 81 58 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c011a497>] [<c013ef6b>] [<c010723b>] Oct 22 12:49:46 kernel: inetd S F70ABEE8 5064 81 1 84 61 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>] Oct 22 12:49:46 kernel: [<c010723b>] Oct 22 12:49:46 kernel: sshd S F7078700 4504 84 1 27733 95 81 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>] Oct 22 12:49:46 kernel: [<c010723b>] Oct 22 12:49:46 kernel: rpc.portmap S 00000246 1376 95 1 98 84 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>] Oct 22 12:49:46 kernel: [<c010723b>] Oct 22 12:49:46 kernel: rpc.rquotad S 00000246 6104 98 1 101 95 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>] Oct 22 12:49:46 kernel: [<c010723b>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 100 1 111 103 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>] Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: lockd S C0307670 6132 101 1 102 109 98 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0307670>] [<c0116c55>] [<c0307670>] [<c030ba60>] [<c0193a3d>] Oct 22 12:49:46 kernel: [<c010578e>] [<c01938e0>] Oct 22 12:49:46 kernel: rpciod S 00000020 6320 102 101 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c011d4fe>] [<c030785b>] [<c0307670>] [<c010578e>] [<c0307670>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 103 1 100 104 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 104 1 103 105 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 105 1 104 106 (L-TLB) Oct 22 12:49:46 kernel: klogd 1.4.1, ---------- state change ---------- Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>] Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 106 1 105 107 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 107 1 106 108 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>] Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3080 108 1 107 109 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>] Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 109 1 108 101 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>] Oct 22 12:49:46 kernel: rpc.mountd S C0280720 2404 111 1 114 100 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0280720>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>] Oct 22 12:49:46 kernel: [<c0150bc9>] [<c0154751>] [<c010723b>] Oct 22 12:49:46 kernel: rpc.statd S F7003ED0 5120 114 1 120 111 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02abfd2>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>] Oct 22 12:49:46 kernel: [<c0150bc9>] [<c0154751>] [<c01401c4>] [<c010723b>] Oct 22 12:49:46 kernel: crond S 00000001 5216 120 1 122 114 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c014a875>] [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>] Oct 22 12:49:46 kernel: atd S 00000001 5108 122 1 128 120 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 4820 128 1 129 122 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 4436 129 1 130 128 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 5064 130 1 131 129 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 5064 131 1 132 130 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 5064 132 1 133 131 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: agetty S C02204F2 5064 133 1 134 132 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>] Oct 22 12:49:46 kernel: ntpd S 00000000 2404 134 1 135 280 133 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>] Oct 22 12:49:46 kernel: [<c010723b>] Oct 22 12:49:46 kernel: ntpd S 00000030 4956 135 134 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c08>] [<c02ab1fc>] [<c0116b50>] [<c0150eb4>] Oct 22 12:49:46 kernel: [<c01510d1>] [<c010723b>] Oct 22 12:49:46 kernel: xfssyncd S 00000002 140 280 1 328 134 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c01f342c>] [<c0116c08>] [<c0116b50>] [<c0204283>] [<c0203731>] Oct 22 12:49:46 kernel: [<c010578e>] [<c0203690>] Oct 22 12:49:46 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f896ad33>] [<f897a1f9>] [<f896fe7b>] Oct 22 12:49:46 kernel: [<f8969bad>] [<f897a1f9>] [<f89663ab>] [<f8969de8>] [<f897a7f0>] [<f896ff5a>] Oct 22 12:49:46 kernel: [<c010578e>] [<f896fee0>] Oct 22 12:49:46 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB) Oct 22 12:49:46 kernel: Call Trace: [<c0311ca6>] [<c0105de9>] [<c0105eb7>] [<f8965174>] [<f897a46d>] Oct 22 12:49:46 kernel: [<f896ff5a>] [<c010578e>] [<f896fee0>] Oct 22 12:49:46 kernel: drbdsetup D 4000A490 0 6151 1 4542 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f8973d4f>] [<f89706e2>] [<f896140c>] Oct 22 12:49:46 kernel: [<f8961e9e>] [<c0146e85>] [<c014f775>] [<c010723b>] Oct 22 12:49:46 kernel: sshd R 00000002 0 22511 84 22513 27498 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013efd7>] [<c01072e9>] Oct 22 12:49:46 kernel: bash R current 0 22513 22511 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c02219ab>] [<c0220c1d>] [<c0311ca6>] [<c011a865>] [<c011aadf>] Oct 22 12:49:46 kernel: [<c011aadf>] [<c01074f9>] [<c01074f9>] [<c01180ed>] [<c022a2cb>] [<c022a229>] Oct 22 12:49:46 kernel: [<c0166b0e>] [<c013f0db>] [<c010723b>] Oct 22 12:49:46 kernel: sshd S C036B8A0 0 27498 84 27500 27733 22511 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>] Oct 22 12:49:46 kernel: [<c0150bc9>] [<c010723b>] Oct 22 12:49:46 kernel: bash S 00000000 1300 27500 27498 27732 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c011e0e8>] [<c021174a>] [<c010723b>] Oct 22 12:49:46 kernel: pico S C920FEB0 0 27732 27500 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c0216c69>] [<c0216c2b>] [<c02152f6>] [<c0214e42>] Oct 22 12:49:46 kernel: [<c0150dcb>] [<c020fce9>] [<c013ef6b>] [<c010723b>] Oct 22 12:49:46 kernel: sshd S C036B8A0 4 27733 84 27735 27498 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>] Oct 22 12:49:46 kernel: [<c0150bc9>] [<c010723b>] Oct 22 12:49:46 kernel: bash S E1AEDEB0 0 27735 27733 (NOTLB) Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c0115f18>] [<c0214e42>] [<c020fce9>] [<c013ef6b>] Oct 22 12:49:46 kernel: [<c010723b>] > or at least give the output of /proc/drbd, and > ps -eo pid,comm,stat,wchan ? # cat /proc/drbd version: 0.7.5 (api:76/proto:74) SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22 0: cs:WFReportParams st:Primary/Unknown ld:Consistent ns:8715928 nr:0 dw:36426099 dr:7245115 al:8923 bm:2370 lo:0 pe:0 ua:0 ap:0 1: cs:Unconfigured # ps -eo pid,comm,stat,wchan PID COMMAND STAT WCHAN 1 init S select 2 keventd S context_thread 3 ksoftirqd_CPU0 SN ksoftirqd 4 kswapd S kswapd 5 bdflush S bdflush 6 kupdated S kupdate 7 xfsbufd S pagebuf_daemon 8 xfslogd/0 S pagebuf_iodone_daemon 9 xfsdatad/0 S pagebuf_iodone_daemon 10 scsi_eh_0 S down_interruptible 11 mdrecoveryd S< skb_copy_datagram_iovec 58 syslogd Ss select 61 klogd Ss syslog 81 inetd Ss select 84 sshd Ss select 95 rpc.portmap Ss poll 98 rpc.rquotad Ss poll 100 nfsd S bitreverse 101 lockd S bitreverse 102 rpciod S vlan_proc_read 103 nfsd S bitreverse 104 nfsd S bitreverse 105 nfsd S bitreverse 106 nfsd S bitreverse 107 nfsd S bitreverse 108 nfsd S bitreverse 109 nfsd S bitreverse 111 rpc.mountd Ss select 114 rpc.statd Ss select 120 crond S nanosleep 122 atd Ss nanosleep 128 agetty Ss+ read_chan 129 agetty Ss+ read_chan 130 agetty Ss+ read_chan 131 agetty Ss+ read_chan 132 agetty Ss+ read_chan 133 agetty Ss+ read_chan 134 ntpd Ss poll 135 ntpd S poll 280 xfssyncd S xfssyncd 328 drbd0_receiver D down 4542 drbd0_worker S down_interruptible 6151 drbdsetup D down 22511 sshd Ss select 22513 bash Ss wait4 28315 ps R+ - many thanks for looking into this :) M. -- ______________________________________________________________ Matthew Hodgson matthew at mxtelecom.com Tel: +44 845 6667778 Systems Analyst, MX Telecom Ltd.