Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote:
> / 2004-10-21 20:03:58 +0100
> \ Matthew Hodgson:
>
>>Most recently however, this happened with the master reporting:
>>
>>NETDEV WATCHDOG: eth1: transmit timed out
>>drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
>>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
>>drbd0: PingAck did not arrive in time.
>>drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure
>>drbd0: asender terminated
>>drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe
>>drbd0: short read expecting header on sock: r=-512
>>drbd0: short sent UnplugRemote size=8 sent=-1001
>>drbd0: worker terminated
>>drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected
>>drbd0: Connection lost.
>>drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection
>>drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams
>>
>>DRBD then hangs hard in WFReportParams mode. The underlying device is
>>still accessible, but the drbdsetup userland utils hang solid, and of
>>course replication is dead.
>>
>>Meanwhile on the slave:
>>
>>e1000: eth1: e1000_watchdog: NIC Link is Down
>>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
>>drbd0: meta connection shut down by peer.
>>drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure
>>drbd0: asender terminated
>>drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe
>>drbd0: short read receiving data block: read 924 expected 4096
>>drbd0: error receiving Data, l: 4112!
>>drbd0: worker terminated
>>drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected
>>drbd0: Connection lost.
>>drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection
>>drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams
>>drbd0: sock_recvmsg returned -11
>>drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe
>>drbd0: short read expecting header on sock: r=-11
>>drbd0: Discarding network configuration.
>>
>>the slave's DRBD comes back up okay, but without the master and being
>>desynced, it's obviously useless.
>>
>>I assume my only option is to reboot the master to get out of this mess
>>- if there is any way to stop DRBD from sometimes doing this on network
>>failure it would be very much appreciated ;)
>
>
>
> hm.
> could you kick the kernel log daemon (klogd -i), and
> then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ?
Apologies for the width & length of information here - i haven't truncated
it as I'm not sure what might be obliquely relavent and what isn't:
# klogd -i
# echo t > /proc/sysrq-trigger
# cat /var/log/kern.log
Oct 22 12:49:46 kernel: SysRq : Show State
Oct 22 12:49:46 kernel:
Oct 22 12:49:46 kernel: free sibling
Oct 22 12:49:46 kernel: task PC stack pid father child younger older
Oct 22 12:49:46 kernel: init S C036B8A0 3868 1 0 6151 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c08>] [<c013724c>] [<c0116b50>] [<c0149da7>]
Oct 22 12:49:46 kernel: [<c0150729>] [<c0150bc9>] [<c0147b09>] [<c010723b>]
Oct 22 12:49:46 kernel: keventd S EC35BED4 6032 2 1 3 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0128711>] [<c0128600>] [<c0105000>] [<c010578e>] [<c0128600>]
Oct 22 12:49:46 kernel: ksoftirqd_CPU S 00000000 4528 3 1 4 2 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c011f93c>] [<c0105000>] [<c010578e>] [<c011f850>]
Oct 22 12:49:46 kernel: kswapd S C2874000 4928 4 1 5 3 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0136537>] [<c0105000>] [<c010578e>] [<c0136480>]
Oct 22 12:49:46 kernel: bdflush S CA238B80 4132 5 1 6 4 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c01175fe>] [<c0144347>] [<c0105000>] [<c010578e>] [<c0144260>]
Oct 22 12:49:46 kernel: kupdated S F7A5D900 3164 6 1 7 5 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c014443d>] [<c01071ee>] [<c0144360>]
Oct 22 12:49:46 kernel: [<c0105000>] [<c010578e>] [<c0144360>]
Oct 22 12:49:46 kernel: xfsbufd S F7AF3880 4272 7 1 8 6 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c011f83a>] [<c02074e7>] [<c0105000>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0207440>]
Oct 22 12:49:46 kernel: xfslogd/0 S C0495644 4764 8 1 9 7 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02072ea>] [<c0105000>] [<c02073d3>] [<c010578e>] [<c02073a0>]
Oct 22 12:49:46 kernel: xfsdatad/0 S C2842000 6188 9 1 10 8 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02072ea>] [<c0105000>] [<c0207413>] [<c010578e>] [<c02073e0>]
Oct 22 12:49:46 kernel: scsi_eh_0 S C0367EE0 6220 10 1 11 9 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0105de9>] [<c0105eb7>] [<c0281f9c>] [<c010578e>] [<c0281cd0>]
Oct 22 12:49:46 kernel: mdrecoveryd S C2834000 0 11 1 58 10 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02a9077>] [<c010578e>] [<c02a8ee0>]
Oct 22 12:49:46 kernel: syslogd S C010678F 1376 58 1 61 11 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c010678f>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46 kernel: [<c0150bc9>] [<c011e82b>] [<c010723b>]
Oct 22 12:49:46 kernel: klogd R 00000000 4692 61 1 81 58 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c011a497>] [<c013ef6b>] [<c010723b>]
Oct 22 12:49:46 kernel: inetd S F70ABEE8 5064 81 1 84 61 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>]
Oct 22 12:49:46 kernel: [<c010723b>]
Oct 22 12:49:46 kernel: sshd S F7078700 4504 84 1 27733 95 81 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>]
Oct 22 12:49:46 kernel: [<c010723b>]
Oct 22 12:49:46 kernel: rpc.portmap S 00000246 1376 95 1 98 84 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46 kernel: [<c010723b>]
Oct 22 12:49:46 kernel: rpc.rquotad S 00000246 6104 98 1 101 95 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46 kernel: [<c010723b>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 100 1 111 103 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: lockd S C0307670 6132 101 1 102 109 98 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0307670>] [<c0116c55>] [<c0307670>] [<c030ba60>] [<c0193a3d>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c01938e0>]
Oct 22 12:49:46 kernel: rpciod S 00000020 6320 102 101 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c011d4fe>] [<c030785b>] [<c0307670>] [<c010578e>] [<c0307670>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 103 1 100 104 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 104 1 103 105 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 105 1 104 106 (L-TLB)
Oct 22 12:49:46 kernel: klogd 1.4.1, ---------- state change ----------
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 106 1 105 107 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 107 1 106 108 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3080 108 1 107 109 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46 kernel: [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: nfsd S 1844A653 3004 109 1 108 101 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46 kernel: rpc.mountd S C0280720 2404 111 1 114 100 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0280720>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46 kernel: [<c0150bc9>] [<c0154751>] [<c010723b>]
Oct 22 12:49:46 kernel: rpc.statd S F7003ED0 5120 114 1 120 111 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02abfd2>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46 kernel: [<c0150bc9>] [<c0154751>] [<c01401c4>] [<c010723b>]
Oct 22 12:49:46 kernel: crond S 00000001 5216 120 1 122 114 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c014a875>] [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>]
Oct 22 12:49:46 kernel: atd S 00000001 5108 122 1 128 120 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 4820 128 1 129 122 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 4436 129 1 130 128 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 5064 130 1 131 129 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 5064 131 1 132 130 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 5064 132 1 133 131 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: agetty S C02204F2 5064 133 1 134 132 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46 kernel: ntpd S 00000000 2404 134 1 135 280 133 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46 kernel: [<c010723b>]
Oct 22 12:49:46 kernel: ntpd S 00000030 4956 135 134 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013724c>] [<c0116c08>] [<c02ab1fc>] [<c0116b50>] [<c0150eb4>]
Oct 22 12:49:46 kernel: [<c01510d1>] [<c010723b>]
Oct 22 12:49:46 kernel: xfssyncd S 00000002 140 280 1 328 134 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c01f342c>] [<c0116c08>] [<c0116b50>] [<c0204283>] [<c0203731>]
Oct 22 12:49:46 kernel: [<c010578e>] [<c0203690>]
Oct 22 12:49:46 kernel: drbd0_receive D 00000001 4416 328 1 4542 280 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f896ad33>] [<f897a1f9>] [<f896fe7b>]
Oct 22 12:49:46 kernel: [<f8969bad>] [<f897a1f9>] [<f89663ab>] [<f8969de8>] [<f897a7f0>] [<f896ff5a>]
Oct 22 12:49:46 kernel: [<c010578e>] [<f896fee0>]
Oct 22 12:49:46 kernel: drbd0_worker S 00000002 4572 4542 1 6151 328 (L-TLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0311ca6>] [<c0105de9>] [<c0105eb7>] [<f8965174>] [<f897a46d>]
Oct 22 12:49:46 kernel: [<f896ff5a>] [<c010578e>] [<f896fee0>]
Oct 22 12:49:46 kernel: drbdsetup D 4000A490 0 6151 1 4542 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0105d12>] [<c0105eac>] [<f8973d4f>] [<f89706e2>] [<f896140c>]
Oct 22 12:49:46 kernel: [<f8961e9e>] [<c0146e85>] [<c014f775>] [<c010723b>]
Oct 22 12:49:46 kernel: sshd R 00000002 0 22511 84 22513 27498 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013efd7>] [<c01072e9>]
Oct 22 12:49:46 kernel: bash R current 0 22513 22511 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c02219ab>] [<c0220c1d>] [<c0311ca6>] [<c011a865>] [<c011aadf>]
Oct 22 12:49:46 kernel: [<c011aadf>] [<c01074f9>] [<c01074f9>] [<c01180ed>] [<c022a2cb>] [<c022a229>]
Oct 22 12:49:46 kernel: [<c0166b0e>] [<c013f0db>] [<c010723b>]
Oct 22 12:49:46 kernel: sshd S C036B8A0 0 27498 84 27500 27733 22511 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>]
Oct 22 12:49:46 kernel: [<c0150bc9>] [<c010723b>]
Oct 22 12:49:46 kernel: bash S 00000000 1300 27500 27498 27732 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c011e0e8>] [<c021174a>] [<c010723b>]
Oct 22 12:49:46 kernel: pico S C920FEB0 0 27732 27500 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c0216c69>] [<c0216c2b>] [<c02152f6>] [<c0214e42>]
Oct 22 12:49:46 kernel: [<c0150dcb>] [<c020fce9>] [<c013ef6b>] [<c010723b>]
Oct 22 12:49:46 kernel: sshd S C036B8A0 4 27733 84 27735 27498 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>]
Oct 22 12:49:46 kernel: [<c0150bc9>] [<c010723b>]
Oct 22 12:49:46 kernel: bash S E1AEDEB0 0 27735 27733 (NOTLB)
Oct 22 12:49:46 kernel: Call Trace: [<c0116c55>] [<c0115f18>] [<c0214e42>] [<c020fce9>] [<c013ef6b>]
Oct 22 12:49:46 kernel: [<c010723b>]
> or at least give the output of /proc/drbd, and
> ps -eo pid,comm,stat,wchan ?
# cat /proc/drbd
version: 0.7.5 (api:76/proto:74)
SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22
0: cs:WFReportParams st:Primary/Unknown ld:Consistent
ns:8715928 nr:0 dw:36426099 dr:7245115 al:8923 bm:2370 lo:0 pe:0 ua:0 ap:0
1: cs:Unconfigured
# ps -eo pid,comm,stat,wchan
PID COMMAND STAT WCHAN
1 init S select
2 keventd S context_thread
3 ksoftirqd_CPU0 SN ksoftirqd
4 kswapd S kswapd
5 bdflush S bdflush
6 kupdated S kupdate
7 xfsbufd S pagebuf_daemon
8 xfslogd/0 S pagebuf_iodone_daemon
9 xfsdatad/0 S pagebuf_iodone_daemon
10 scsi_eh_0 S down_interruptible
11 mdrecoveryd S< skb_copy_datagram_iovec
58 syslogd Ss select
61 klogd Ss syslog
81 inetd Ss select
84 sshd Ss select
95 rpc.portmap Ss poll
98 rpc.rquotad Ss poll
100 nfsd S bitreverse
101 lockd S bitreverse
102 rpciod S vlan_proc_read
103 nfsd S bitreverse
104 nfsd S bitreverse
105 nfsd S bitreverse
106 nfsd S bitreverse
107 nfsd S bitreverse
108 nfsd S bitreverse
109 nfsd S bitreverse
111 rpc.mountd Ss select
114 rpc.statd Ss select
120 crond S nanosleep
122 atd Ss nanosleep
128 agetty Ss+ read_chan
129 agetty Ss+ read_chan
130 agetty Ss+ read_chan
131 agetty Ss+ read_chan
132 agetty Ss+ read_chan
133 agetty Ss+ read_chan
134 ntpd Ss poll
135 ntpd S poll
280 xfssyncd S xfssyncd
328 drbd0_receiver D down
4542 drbd0_worker S down_interruptible
6151 drbdsetup D down
22511 sshd Ss select
22513 bash Ss wait4
28315 ps R+ -
many thanks for looking into this :)
M.
--
______________________________________________________________
Matthew Hodgson matthew at mxtelecom.com Tel: +44 845 6667778
Systems Analyst, MX Telecom Ltd.