[DRBD-user] 0.7.4 in state WFReportParams forever ?

Matthew Hodgson matthew at mxtelecom.com
Fri Oct 22 13:55:46 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars Ellenberg wrote:

> / 2004-10-21 20:03:58 +0100
> \ Matthew Hodgson:
> 
>>Most recently however, this happened with the master reporting:
>>
>>NETDEV WATCHDOG: eth1: transmit timed out
>>drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
>>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
>>drbd0: PingAck did not arrive in time.
>>drbd0: drbd0_asender [329]: cstate Connected --> NetworkFailure
>>drbd0: asender terminated
>>drbd0: drbd0_receiver [328]: cstate NetworkFailure --> BrokenPipe
>>drbd0: short read expecting header on sock: r=-512
>>drbd0: short sent UnplugRemote size=8 sent=-1001
>>drbd0: worker terminated
>>drbd0: drbd0_receiver [328]: cstate BrokenPipe --> Unconnected
>>drbd0: Connection lost.
>>drbd0: drbd0_receiver [328]: cstate Unconnected --> WFConnection
>>drbd0: drbd0_receiver [328]: cstate WFConnection --> WFReportParams
>>
>>DRBD then hangs hard in WFReportParams mode.  The underlying device is 
>>still accessible, but the drbdsetup userland utils hang solid, and of 
>>course replication is dead.
>>
>>Meanwhile on the slave:
>>
>>e1000: eth1: e1000_watchdog: NIC Link is Down
>>e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
>>drbd0: meta connection shut down by peer.
>>drbd0: drbd0_asender [243]: cstate Connected --> NetworkFailure
>>drbd0: asender terminated
>>drbd0: drbd0_receiver [236]: cstate NetworkFailure --> BrokenPipe
>>drbd0: short read receiving data block: read 924 expected 4096
>>drbd0: error receiving Data, l: 4112!
>>drbd0: worker terminated
>>drbd0: drbd0_receiver [236]: cstate BrokenPipe --> Unconnected
>>drbd0: Connection lost.
>>drbd0: drbd0_receiver [236]: cstate Unconnected --> WFConnection
>>drbd0: drbd0_receiver [236]: cstate WFConnection --> WFReportParams
>>drbd0: sock_recvmsg returned -11
>>drbd0: drbd0_receiver [236]: cstate WFReportParams --> BrokenPipe
>>drbd0: short read expecting header on sock: r=-11
>>drbd0: Discarding network configuration.
>>
>>the slave's DRBD comes back up okay, but without the master and being 
>>desynced, it's obviously useless.
>>
>>I assume my only option is to reboot the master to get out of this mess 
>>- if there is any way to stop DRBD from sometimes doing this on network 
>>failure it would be very much appreciated ;)
> 
> 
> 
> hm.
> could you kick the kernel log daemon (klogd -i), and
> then trigger a sysrq Task dump (echo t > /proc/sysrq-trigger) ?

Apologies for the width & length of information here - i haven't truncated
it as I'm not sure what might be obliquely relavent and what isn't:

# klogd -i
# echo t > /proc/sysrq-trigger
# cat /var/log/kern.log
Oct 22 12:49:46  kernel: SysRq : Show State
Oct 22 12:49:46  kernel:
Oct 22 12:49:46  kernel:                          free                        sibling
Oct 22 12:49:46  kernel:   task             PC    stack   pid father child younger older
Oct 22 12:49:46  kernel: init          S C036B8A0  3868     1      0  6151               (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013701b>] [<c0116c08>] [<c013724c>] [<c0116b50>] [<c0149da7>]
Oct 22 12:49:46  kernel:   [<c0150729>] [<c0150bc9>] [<c0147b09>] [<c010723b>]
Oct 22 12:49:46  kernel: keventd       S EC35BED4  6032     2      1             3       (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0128711>] [<c0128600>] [<c0105000>] [<c010578e>] [<c0128600>]
Oct 22 12:49:46  kernel: ksoftirqd_CPU S 00000000  4528     3      1             4     2 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c011f93c>] [<c0105000>] [<c010578e>] [<c011f850>]
Oct 22 12:49:46  kernel: kswapd        S C2874000  4928     4      1             5     3 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0136537>] [<c0105000>] [<c010578e>] [<c0136480>]
Oct 22 12:49:46  kernel: bdflush       S CA238B80  4132     5      1             6     4 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c01175fe>] [<c0144347>] [<c0105000>] [<c010578e>] [<c0144260>]
Oct 22 12:49:46  kernel: kupdated      S F7A5D900  3164     6      1             7     5 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c014443d>] [<c01071ee>] [<c0144360>]
Oct 22 12:49:46  kernel:   [<c0105000>] [<c010578e>] [<c0144360>]
Oct 22 12:49:46  kernel: xfsbufd       S F7AF3880  4272     7      1             8     6 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c011f83a>] [<c02074e7>] [<c0105000>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0207440>]
Oct 22 12:49:46  kernel: xfslogd/0     S C0495644  4764     8      1             9     7 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02072ea>] [<c0105000>] [<c02073d3>] [<c010578e>] [<c02073a0>]
Oct 22 12:49:46  kernel: xfsdatad/0    S C2842000  6188     9      1            10     8 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02072ea>] [<c0105000>] [<c0207413>] [<c010578e>] [<c02073e0>]
Oct 22 12:49:46  kernel: scsi_eh_0     S C0367EE0  6220    10      1            11     9 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0105de9>] [<c0105eb7>] [<c0281f9c>] [<c010578e>] [<c0281cd0>]
Oct 22 12:49:46  kernel: mdrecoveryd   S C2834000     0    11      1            58    10 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02a9077>] [<c010578e>] [<c02a8ee0>]
Oct 22 12:49:46  kernel: syslogd       S C010678F  1376    58      1            61    11 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c010678f>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46  kernel:   [<c0150bc9>] [<c011e82b>] [<c010723b>]
Oct 22 12:49:46  kernel: klogd         R 00000000  4692    61      1            81    58 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c011a497>] [<c013ef6b>] [<c010723b>]
Oct 22 12:49:46  kernel: inetd         S F70ABEE8  5064    81      1            84    61 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>]
Oct 22 12:49:46  kernel:   [<c010723b>]
Oct 22 12:49:46  kernel: sshd          S F7078700  4504    84      1 27733      95    81 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013724c>] [<c0116c55>] [<c02ab1fc>] [<c0150729>] [<c0150bc9>]
Oct 22 12:49:46  kernel:   [<c010723b>]
Oct 22 12:49:46  kernel: rpc.portmap   S 00000246  1376    95      1            98    84 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46  kernel:   [<c010723b>]
Oct 22 12:49:46  kernel: rpc.rquotad   S 00000246  6104    98      1           101    95 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46  kernel:   [<c010723b>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   100      1           111   103 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46  kernel:   [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: lockd         S C0307670  6132   101      1   102     109    98 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0307670>] [<c0116c55>] [<c0307670>] [<c030ba60>] [<c0193a3d>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c01938e0>]
Oct 22 12:49:46  kernel: rpciod        S 00000020  6320   102    101                     (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c011d4fe>] [<c030785b>] [<c0307670>] [<c010578e>] [<c0307670>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   103      1           100   104 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   104      1           103   105 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   105      1           104   106 (L-TLB)
Oct 22 12:49:46  kernel: klogd 1.4.1, ---------- state change ----------
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46  kernel:   [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   106      1           105   107 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   107      1           106   108 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46  kernel:   [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3080   108      1           107   109 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0309eb6>] [<c0116b50>] [<c030bc26>] [<c030ba60>]
Oct 22 12:49:46  kernel:   [<c0182fe0>] [<c030985c>] [<c0182d99>] [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: nfsd          S 1844A653  3004   109      1           108   101 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c030ba60>] [<c030985c>] [<c0182d99>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0182ca0>]
Oct 22 12:49:46  kernel: rpc.mountd    S C0280720  2404   111      1           114   100 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0280720>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46  kernel:   [<c0150bc9>] [<c0154751>] [<c010723b>]
Oct 22 12:49:46  kernel: rpc.statd     S F7003ED0  5120   114      1           120   111 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02abfd2>] [<c0116c55>] [<c02b122e>] [<c02ab1fc>] [<c0150729>]
Oct 22 12:49:46  kernel:   [<c0150bc9>] [<c0154751>] [<c01401c4>] [<c010723b>]
Oct 22 12:49:46  kernel: crond         S 00000001  5216   120      1           122   114 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c014a875>] [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>]
Oct 22 12:49:46  kernel: atd           S 00000001  5108   122      1           128   120 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c08>] [<c0116b50>] [<c012388b>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  4820   128      1           129   122 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  4436   129      1           130   128 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  5064   130      1           131   129 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  5064   131      1           132   130 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  5064   132      1           133   131 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: agetty        S C02204F2  5064   133      1           134   132 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02204f2>] [<c0116c55>] [<c0221120>] [<c0220ec9>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c020fce9>] [<c013ef6b>] [<c012bf93>] [<c010723b>]
Oct 22 12:49:46  kernel: ntpd          S 00000000  2404   134      1   135     280   133 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c55>] [<c02ab1fc>] [<c0150dcb>] [<c0150eb4>] [<c01510d1>]
Oct 22 12:49:46  kernel:   [<c010723b>]
Oct 22 12:49:46  kernel: ntpd          S 00000030  4956   135    134                     (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013724c>] [<c0116c08>] [<c02ab1fc>] [<c0116b50>] [<c0150eb4>]
Oct 22 12:49:46  kernel:   [<c01510d1>] [<c010723b>]
Oct 22 12:49:46  kernel: xfssyncd      S 00000002   140   280      1           328   134 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c01f342c>] [<c0116c08>] [<c0116b50>] [<c0204283>] [<c0203731>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<c0203690>]
Oct 22 12:49:46  kernel: drbd0_receive D 00000001  4416   328      1          4542   280 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0105d12>] [<c0105eac>] [<f896ad33>] [<f897a1f9>] [<f896fe7b>]
Oct 22 12:49:46  kernel:   [<f8969bad>] [<f897a1f9>] [<f89663ab>] [<f8969de8>] [<f897a7f0>] [<f896ff5a>]
Oct 22 12:49:46  kernel:   [<c010578e>] [<f896fee0>]
Oct 22 12:49:46  kernel: drbd0_worker  S 00000002  4572  4542      1          6151   328 (L-TLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0311ca6>] [<c0105de9>] [<c0105eb7>] [<f8965174>] [<f897a46d>]
Oct 22 12:49:46  kernel:   [<f896ff5a>] [<c010578e>] [<f896fee0>]
Oct 22 12:49:46  kernel: drbdsetup     D 4000A490     0  6151      1                4542 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0105d12>] [<c0105eac>] [<f8973d4f>] [<f89706e2>] [<f896140c>]
Oct 22 12:49:46  kernel:   [<f8961e9e>] [<c0146e85>] [<c014f775>] [<c010723b>]
Oct 22 12:49:46  kernel: sshd          R 00000002     0 22511     84 22513   27498       (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013efd7>] [<c01072e9>]
Oct 22 12:49:46  kernel: bash          R current      0 22513  22511                     (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c02219ab>] [<c0220c1d>] [<c0311ca6>] [<c011a865>] [<c011aadf>]
Oct 22 12:49:46  kernel:   [<c011aadf>] [<c01074f9>] [<c01074f9>] [<c01180ed>] [<c022a2cb>] [<c022a229>]
Oct 22 12:49:46  kernel:   [<c0166b0e>] [<c013f0db>] [<c010723b>]
Oct 22 12:49:46  kernel: sshd          S C036B8A0     0 27498     84 27500   27733 22511 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>]
Oct 22 12:49:46  kernel:   [<c0150bc9>] [<c010723b>]
Oct 22 12:49:46  kernel: bash          S 00000000  1300 27500  27498 27732               (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c011e0e8>] [<c021174a>] [<c010723b>]
Oct 22 12:49:46  kernel: pico          S C920FEB0     0 27732  27500                     (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c55>] [<c0216c69>] [<c0216c2b>] [<c02152f6>] [<c0214e42>]
Oct 22 12:49:46  kernel:   [<c0150dcb>] [<c020fce9>] [<c013ef6b>] [<c010723b>]
Oct 22 12:49:46  kernel: sshd          S C036B8A0     4 27733     84 27735         27498 (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c013701b>] [<c0116c55>] [<c0216c2b>] [<c02152f6>] [<c0150729>]
Oct 22 12:49:46  kernel:   [<c0150bc9>] [<c010723b>]
Oct 22 12:49:46  kernel: bash          S E1AEDEB0     0 27735  27733                     (NOTLB)
Oct 22 12:49:46  kernel: Call Trace:    [<c0116c55>] [<c0115f18>] [<c0214e42>] [<c020fce9>] [<c013ef6b>]
Oct 22 12:49:46  kernel:   [<c010723b>]

> or at least give the output of /proc/drbd, and
> ps -eo pid,comm,stat,wchan ?

# cat /proc/drbd
version: 0.7.5 (api:76/proto:74)
SVN Revision: 1578 build by root at mxtelecom.com, 2004-10-10 18:54:22
  0: cs:WFReportParams st:Primary/Unknown ld:Consistent
     ns:8715928 nr:0 dw:36426099 dr:7245115 al:8923 bm:2370 lo:0 pe:0 ua:0 ap:0
  1: cs:Unconfigured

# ps -eo pid,comm,stat,wchan
   PID COMMAND          STAT WCHAN
     1 init             S    select
     2 keventd          S    context_thread
     3 ksoftirqd_CPU0   SN   ksoftirqd
     4 kswapd           S    kswapd
     5 bdflush          S    bdflush
     6 kupdated         S    kupdate
     7 xfsbufd          S    pagebuf_daemon
     8 xfslogd/0        S    pagebuf_iodone_daemon
     9 xfsdatad/0       S    pagebuf_iodone_daemon
    10 scsi_eh_0        S    down_interruptible
    11 mdrecoveryd      S<   skb_copy_datagram_iovec
    58 syslogd          Ss   select
    61 klogd            Ss   syslog
    81 inetd            Ss   select
    84 sshd             Ss   select
    95 rpc.portmap      Ss   poll
    98 rpc.rquotad      Ss   poll
   100 nfsd             S    bitreverse
   101 lockd            S    bitreverse
   102 rpciod           S    vlan_proc_read
   103 nfsd             S    bitreverse
   104 nfsd             S    bitreverse
   105 nfsd             S    bitreverse
   106 nfsd             S    bitreverse
   107 nfsd             S    bitreverse
   108 nfsd             S    bitreverse
   109 nfsd             S    bitreverse
   111 rpc.mountd       Ss   select
   114 rpc.statd        Ss   select
   120 crond            S    nanosleep
   122 atd              Ss   nanosleep
   128 agetty           Ss+  read_chan
   129 agetty           Ss+  read_chan
   130 agetty           Ss+  read_chan
   131 agetty           Ss+  read_chan
   132 agetty           Ss+  read_chan
   133 agetty           Ss+  read_chan
   134 ntpd             Ss   poll
   135 ntpd             S    poll
   280 xfssyncd         S    xfssyncd
   328 drbd0_receiver   D    down
  4542 drbd0_worker     S    down_interruptible
  6151 drbdsetup        D    down
22511 sshd             Ss   select
22513 bash             Ss   wait4
28315 ps               R+   -

many thanks for looking into this :)

M.

-- 
______________________________________________________________
Matthew Hodgson   matthew at mxtelecom.com   Tel: +44 845 6667778
                 Systems Analyst, MX Telecom Ltd.




More information about the drbd-user mailing list