[DRBD-user] DRBD stuck after a strong network failure

Lars Ellenberg Lars.Ellenberg at linbit.com
Tue Jul 25 03:32:08 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2006-07-24 12:39:51 +0300
\ Cyril Bouthors:
> On  8 May 2006, Lars Ellenberg wrote:
> > The best I can do is recommend to use a 2.6 kernel,
> > and see if it gets better.
> 
> I've upgraded to Linux 2.6.16-2-686 (Debian/sid flavor) and DRBD
> 0.7.19 but the same thing still happens on primary after a network
> failure :
> 
...
> Jul 23 00:51:18 sqlb1 kernel: drbd0: Connection lost.
> Jul 23 00:51:18 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate Unconnected --> WFConnection
> Jul 23 00:51:26 sqlb1 kernel: drbd0: drbd0_receiver [3479]: cstate WFConnection --> WFReportParams
> 
> And then, nothing in the logs for 12+ hours and DRBD is still stuck in
> 'WFReportParams':
> 
> sqlb1:~# cat /proc/drbd
> version: 0.7.19 (api:78/proto:74)
> SVN Revision: 2212 build by root at sqlb1, 2006-07-05 15:00:29
>  0: cs:WFReportParams st:Primary/Unknown ld:Consistent
>     ns:3875568 nr:0 dw:150754376 dr:473239985 al:3637843 bm:238956 lo:0 pe:0 ua:0 ap:0
> sqlb1:~#
> 
> 
> This time, the reaction is better than when I was running 2.4 because
> the DRBD partition is still readable and writeable.

> Lars, I have excellent news for you, this time I kept the machine in
> the WFReportParams state in order for us to get rid of this bug
> forever.
> 
> I have no kernel debugger/sysrq/whatever experience, would you please
> tell me what to do? Or maybe the easiest thing for you would be to
> connect directly to the machine? I trust you, I can give you SSH root
> access if you want it. What do you think? Please let me know.

thanks, probably no need for this.
we did some test runs with an aggressively preemtible smp kernel,
to increase the likelihood for our test environment to hit
race-condition triggered bugs.
I probably fixed this "stuck in WFReportParams" already in svn
(only last week). you can check it in the svn log archived with
the (historically named) drbd-cvs list, or using the svn log command.

unfortunately you have to reboot the box, since if it is the same thing
I fixed last week, it will do a kernel oops (NULL pointer dereference in
force_sig) when you try to reconnect or take down drbd.

caution, svn will see some more updates until yet an other bugfix
release (0.7.21) probably early next week.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list