[DRBD-user] DRBD stuck after a strong network failure

Cyril Bouthors cyril at bouthors.org
Mon Jul 24 11:39:51 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On  8 May 2006, Lars Ellenberg wrote:

> / 2006-05-08 15:18:31 +0300
> \ Cyril Bouthors:
>> Another interesting thing on the primary:
>>
>> # ps -axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep drbd
>> 2493 down              [drbd0_receiver]            drbd0_receiver
>> 5514 ?                 drbdsetup /dev/drbd0 net 10 drbdsetup
>> 5661 down_interruptibl drbdsetup /dev/drbd0 net 10 drbdsetup
>> 6203 down_interruptibl drbdsetup /dev/drbd0 state drbdsetup
>>
>> "drbdsetup /dev/drbd0 state" seems to be running for quite some time,
>> it has been launched by heartbeat.
>
> only one drbdsetup per /dev/drbdX is allowed at any time.
> thats why the other two are in "down_interruptible" state, they wait for
> the ioctl semaphore.
>
> I'm sorry, this is not debuggable this way. I cannot even tell where
> drbd does hang, if it hangs. Maybe it hangs in some lower level kernel
> function, that is not supposed to hang. Maybe it is something in our
> kernel 2.4 compatibility layer.  There is not enought information for me
> to give any specific diagnosis, and unless you are able to get some
> "stack traces" (via kernel debugger/sysrq/whatever) to see where exactly
> which kernel threads are waiting on each other (which I assume is the
> case), I can not help you here.
>
> The best I can do is recommend to use a 2.6 kernel,
> and see if it gets better.

I've upgraded to Linux 2.6.16-2-686 (Debian/sid flavor) and DRBD
0.7.19 but the same thing still happens on primary after a network
failure :

Jul 22 21:47:21 sqlb1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 1056 K/sec)
Jul 22 21:47:21 sqlb1 kernel: drbd0: drbd0_worker [8975]: cstate SyncSource --> Connected
Jul 22 22:49:15 sqlb1 kernel: drbd0: meta connection shut down by peer.
Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_asender [8976]: cstate Connected --> NetworkFailure
Jul 22 22:49:16 sqlb1 kernel: drbd0: asender terminated
Jul 22 22:49:16 sqlb1 kernel: drbd0: sock was shut down by peer
Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe
Jul 22 22:49:16 sqlb1 kernel: drbd0: short read expecting header on sock: r=0
Jul 22 22:49:16 sqlb1 kernel: drbd0: worker terminated
Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected
Jul 22 22:49:16 sqlb1 kernel: drbd0: Connection lost.
Jul 22 22:49:16 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection
Jul 22 22:49:17 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams
Jul 22 22:49:17 sqlb1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 22 22:49:17 sqlb1 kernel: drbd0: Connection established.
Jul 22 22:49:17 sqlb1 kernel: drbd0: I am(P): 1:00000003:00000006:000000cd:00000008:10
Jul 22 22:49:17 sqlb1 kernel: drbd0: Peer(S): 1:00000003:00000006:000000cc:00000008:01
Jul 22 22:49:17 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFReportParams --> WFBitMapS
Jul 22 22:49:18 sqlb1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Jul 22 22:49:19 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFBitMapS --> SyncSource
Jul 22 22:49:19 sqlb1 kernel: drbd0: Resync started as SyncSource (need to sync 792 KB [198 bits set]).
Jul 22 22:49:21 sqlb1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 792 K/sec)
Jul 22 22:49:21 sqlb1 kernel: drbd0: drbd0_worker [14010]: cstate SyncSource --> Connected
Jul 23 00:40:15 sqlb1 kernel: drbd0: meta connection shut down by peer.
Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_asender [14011]: cstate Connected --> NetworkFailure
Jul 23 00:40:15 sqlb1 kernel: drbd0: asender terminated
Jul 23 00:40:15 sqlb1 kernel: drbd0: sock was shut down by peer
Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe
Jul 23 00:40:15 sqlb1 kernel: drbd0: short read expecting header on sock: r=0
Jul 23 00:40:15 sqlb1 kernel: drbd0: worker terminated
Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected
Jul 23 00:40:15 sqlb1 kernel: drbd0: Connection lost.
Jul 23 00:40:15 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection
Jul 23 00:40:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams
Jul 23 00:40:18 sqlb1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 23 00:40:18 sqlb1 kernel: drbd0: Connection established.
Jul 23 00:40:18 sqlb1 kernel: drbd0: I am(P): 1:00000003:00000006:000000ce:00000008:10
Jul 23 00:40:18 sqlb1 kernel: drbd0: Peer(S): 1:00000003:00000006:000000cd:00000008:01
Jul 23 00:40:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFReportParams --> WFBitMapS
Jul 23 00:40:18 sqlb1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Jul 23 00:40:19 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFBitMapS --> SyncSource
Jul 23 00:40:19 sqlb1 kernel: drbd0: Resync started as SyncSource (need to sync 620 KB [155 bits set]).
Jul 23 00:40:29 sqlb1 kernel: drbd0: Resync done (total 10 sec; paused 0 sec; 60 K/sec)
Jul 23 00:40:29 sqlb1 kernel: drbd0: drbd0_worker [23710]: cstate SyncSource --> Connected
Jul 23 00:51:18 sqlb1 kernel: drbd0: meta connection shut down by peer.
Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_asender [23727]: cstate Connected --> NetworkFailure
Jul 23 00:51:18 sqlb1 kernel: drbd0: asender terminated
Jul 23 00:51:18 sqlb1 kernel: drbd0: short sent Barrier size=16 sent=-1001
Jul 23 00:51:18 sqlb1 kernel: drbd0: sock was shut down by peer
Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate NetworkFailure --> BrokenPipe
Jul 23 00:51:18 sqlb1 kernel: drbd0: short read expecting header on sock: r=0
Jul 23 00:51:18 sqlb1 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001
Jul 23 00:51:18 sqlb1 kernel: drbd0: worker terminated
Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate BrokenPipe --> Unconnected
Jul 23 00:51:18 sqlb1 kernel: drbd0: Connection lost.
Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection
Jul 23 00:51:26 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams

And then, nothing in the logs for 12+ hours and DRBD is still stuck in
'WFReportParams':

sqlb1:~# cat /proc/drbd
version: 0.7.19 (api:78/proto:74)
SVN Revision: 2212 build by root at sqlb1, 2006-07-05 15:00:29
 0: cs:WFReportParams st:Primary/Unknown ld:Consistent
    ns:3875568 nr:0 dw:150754376 dr:473239985 al:3637843 bm:238956 lo:0 pe:0 ua:0 ap:0
sqlb1:~#


This time, the reaction is better than when I was running 2.4 because
the DRBD partition is still readable and writeable.

I have no abnormal DRBD process running:

sqlb1:~# ps auxwf | grep -i drbd
root      3479  0.0  0.0      0     0 ?        S    Jul05   4:23 [drbd0_receiver]
root     24915  0.0  0.0      0     0 ?        S    Jul23   0:06 [drbd0_worker]
sqlb1:~#

The machine has been running 3 weeks without any problem before this
event.

> I'm sorry, this is not debuggable this way. I cannot even tell where
> drbd does hang, if it hangs. Maybe it hangs in some lower level
> kernel function, that is not supposed to hang. Maybe it is something
> in our kernel 2.4 compatibility layer. There is not enought
> information for me to give any specific diagnosis, and unless you
> are able to get some "stack traces" (via kernel
> debugger/sysrq/whatever) to see where exactly which kernel threads
> are waiting on each other (which I assume is the case), I can not
> help you here.

Lars, I have excellent news for you, this time I kept the machine in
the WFReportParams state in order for us to get rid of this bug
forever.

I have no kernel debugger/sysrq/whatever experience, would you please
tell me what to do? Or maybe the easiest thing for you would be to
connect directly to the machine? I trust you, I can give you SSH root
access if you want it. What do you think? Please let me know.

Thanks
-- 
Cyril Bouthors
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060724/eb451669/attachment.pgp>


More information about the drbd-user mailing list