Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 8 May 2006, Lars Ellenberg wrote: > / 2006-05-08 15:18:31 +0300 > \ Cyril Bouthors: >> Another interesting thing on the primary: >> >> # ps -axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep drbd >> 2493 down [drbd0_receiver] drbd0_receiver >> 5514 ? drbdsetup /dev/drbd0 net 10 drbdsetup >> 5661 down_interruptibl drbdsetup /dev/drbd0 net 10 drbdsetup >> 6203 down_interruptibl drbdsetup /dev/drbd0 state drbdsetup >> >> "drbdsetup /dev/drbd0 state" seems to be running for quite some time, >> it has been launched by heartbeat. > > only one drbdsetup per /dev/drbdX is allowed at any time. > thats why the other two are in "down_interruptible" state, they wait for > the ioctl semaphore. > > I'm sorry, this is not debuggable this way. I cannot even tell where > drbd does hang, if it hangs. Maybe it hangs in some lower level kernel > function, that is not supposed to hang. Maybe it is something in our > kernel 2.4 compatibility layer. There is not enought information for me > to give any specific diagnosis, and unless you are able to get some > "stack traces" (via kernel debugger/sysrq/whatever) to see where exactly > which kernel threads are waiting on each other (which I assume is the > case), I can not help you here. > > The best I can do is recommend to use a 2.6 kernel, > and see if it gets better. I've upgraded to Linux 2.6.16-2-686 (Debian/sid flavor) and DRBD 0.7.19 but the same thing still happens on primary after a network failure : Jul 22 21:47:21 sqlb1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 1056 K/sec) Jul 22 21:47:21 sqlb1 kernel: drbd0: drbd0_worker [8975]: cstate SyncSource --> Connected Jul 22 22:49:15 sqlb1 kernel: drbd0: meta connection shut down by peer. Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_asender [8976]: cstate Connected --> NetworkFailure Jul 22 22:49:16 sqlb1 kernel: drbd0: asender terminated Jul 22 22:49:16 sqlb1 kernel: drbd0: sock was shut down by peer Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe Jul 22 22:49:16 sqlb1 kernel: drbd0: short read expecting header on sock: r=0 Jul 22 22:49:16 sqlb1 kernel: drbd0: worker terminated Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected Jul 22 22:49:16 sqlb1 kernel: drbd0: Connection lost. Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection Jul 22 22:49:17 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams Jul 22 22:49:17 sqlb1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jul 22 22:49:17 sqlb1 kernel: drbd0: Connection established. Jul 22 22:49:17 sqlb1 kernel: drbd0: I am(P): 1:00000003:00000006:000000cd:00000008:10 Jul 22 22:49:17 sqlb1 kernel: drbd0: Peer(S): 1:00000003:00000006:000000cc:00000008:01 Jul 22 22:49:17 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFReportParams --> WFBitMapS Jul 22 22:49:18 sqlb1 kernel: drbd0: Primary/Unknown --> Primary/Secondary Jul 22 22:49:19 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFBitMapS --> SyncSource Jul 22 22:49:19 sqlb1 kernel: drbd0: Resync started as SyncSource (need to sync 792 KB [198 bits set]). Jul 22 22:49:21 sqlb1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 792 K/sec) Jul 22 22:49:21 sqlb1 kernel: drbd0: drbd0_worker [14010]: cstate SyncSource --> Connected Jul 23 00:40:15 sqlb1 kernel: drbd0: meta connection shut down by peer. Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_asender [14011]: cstate Connected --> NetworkFailure Jul 23 00:40:15 sqlb1 kernel: drbd0: asender terminated Jul 23 00:40:15 sqlb1 kernel: drbd0: sock was shut down by peer Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe Jul 23 00:40:15 sqlb1 kernel: drbd0: short read expecting header on sock: r=0 Jul 23 00:40:15 sqlb1 kernel: drbd0: worker terminated Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected Jul 23 00:40:15 sqlb1 kernel: drbd0: Connection lost. Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection Jul 23 00:40:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams Jul 23 00:40:18 sqlb1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jul 23 00:40:18 sqlb1 kernel: drbd0: Connection established. Jul 23 00:40:18 sqlb1 kernel: drbd0: I am(P): 1:00000003:00000006:000000ce:00000008:10 Jul 23 00:40:18 sqlb1 kernel: drbd0: Peer(S): 1:00000003:00000006:000000cd:00000008:01 Jul 23 00:40:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFReportParams --> WFBitMapS Jul 23 00:40:18 sqlb1 kernel: drbd0: Primary/Unknown --> Primary/Secondary Jul 23 00:40:19 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFBitMapS --> SyncSource Jul 23 00:40:19 sqlb1 kernel: drbd0: Resync started as SyncSource (need to sync 620 KB [155 bits set]). Jul 23 00:40:29 sqlb1 kernel: drbd0: Resync done (total 10 sec; paused 0 sec; 60 K/sec) Jul 23 00:40:29 sqlb1 kernel: drbd0: drbd0_worker [23710]: cstate SyncSource --> Connected Jul 23 00:51:18 sqlb1 kernel: drbd0: meta connection shut down by peer. Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_asender [23727]: cstate Connected --> NetworkFailure Jul 23 00:51:18 sqlb1 kernel: drbd0: asender terminated Jul 23 00:51:18 sqlb1 kernel: drbd0: short sent Barrier size=16 sent=-1001 Jul 23 00:51:18 sqlb1 kernel: drbd0: sock was shut down by peer Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe Jul 23 00:51:18 sqlb1 kernel: drbd0: short read expecting header on sock: r=0 Jul 23 00:51:18 sqlb1 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001 Jul 23 00:51:18 sqlb1 kernel: drbd0: worker terminated Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected Jul 23 00:51:18 sqlb1 kernel: drbd0: Connection lost. Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection Jul 23 00:51:26 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams And then, nothing in the logs for 12+ hours and DRBD is still stuck in 'WFReportParams': sqlb1:~# cat /proc/drbd version: 0.7.19 (api:78/proto:74) SVN Revision: 2212 build by root at sqlb1, 2006-07-05 15:00:29 0: cs:WFReportParams st:Primary/Unknown ld:Consistent ns:3875568 nr:0 dw:150754376 dr:473239985 al:3637843 bm:238956 lo:0 pe:0 ua:0 ap:0 sqlb1:~# This time, the reaction is better than when I was running 2.4 because the DRBD partition is still readable and writeable. I have no abnormal DRBD process running: sqlb1:~# ps auxwf | grep -i drbd root 3479 0.0 0.0 0 0 ? S Jul05 4:23 [drbd0_receiver] root 24915 0.0 0.0 0 0 ? S Jul23 0:06 [drbd0_worker] sqlb1:~# The machine has been running 3 weeks without any problem before this event. > I'm sorry, this is not debuggable this way. I cannot even tell where > drbd does hang, if it hangs. Maybe it hangs in some lower level > kernel function, that is not supposed to hang. Maybe it is something > in our kernel 2.4 compatibility layer. There is not enought > information for me to give any specific diagnosis, and unless you > are able to get some "stack traces" (via kernel > debugger/sysrq/whatever) to see where exactly which kernel threads > are waiting on each other (which I assume is the case), I can not > help you here. Lars, I have excellent news for you, this time I kept the machine in the WFReportParams state in order for us to get rid of this bug forever. I have no kernel debugger/sysrq/whatever experience, would you please tell me what to do? Or maybe the easiest thing for you would be to connect directly to the machine? I trust you, I can give you SSH root access if you want it. What do you think? Please let me know. Thanks -- Cyril Bouthors -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 188 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060724/eb451669/attachment.pgp>