[DRBD-user] Re: Network Failure

Sat Sep 23 12:48:42 CEST 2006

/ 2006-09-22 16:16:27 +0100
\ Mark Olliver:
> Hi,
> 
>  
> 
> I have just noticed my drbd cluster has a status of 
> 
> version: 0.7.19 (api:78/proto:74)
> 
> SVN Revision: 2212 build by root at cp1.thermeoneurope.com, 2006-05-30 12:57:03
> 
>  0: cs:NetworkFailure st:Primary/Secondary ld:Consistent
> 
>     ns:384452 nr:4880 dw:137663608 dr:34555860 al:102939 bm:22058 lo:0 pe:0
> ua:0 ap:0
> 
>  
> 
> This was the same on both the primary and secondary (obviously just the
> other way around). As I still have live people working on the primary I
> tried taking down the secondary and restarting it. This seamed ok now the
> secondary though is in the following state:
> 
> [root at ie-openvz1 ~]# cat /proc/drbd
> 
> version: 0.7.19 (api:78/proto:74)
> 
> SVN Revision: 2212 build by root at cp-bk.thermeoneurope.com, 2006-05-30
> 15:22:21
> 
>  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
> 
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
> 
>  
> 
> Still not seeing the primary. I can ping between the two ok, there are no
> firewall rules blocking them and have been working fine until now. I do not
> really want to take the primary down as there are users live on it.
> 
>  
> 
> The two machines are separated one is in Dublin the other in London
> connected over the internet via vpn.

you probably got bitten by some variant of what is referenced in the
changelog as
0.7.21 ...
 * Fixed the "stalled in WFParams" after reconnect symptom. The cause
   of this bug was actually a misuse of the data socket.

> Any ideas on how to get these two to talk again gratefully received.

NetWorkFailure normally is a transient state, that should go right into
WFConnection (or, in some other cases) into StandAlone.

since it is stuck somewhere on the way, there is probably no good way to
recover short of a reboot (or maybe forced module unload, but I won't do
that). make sure your secondary won't "accidentally" take over, since it
has consistent, but _out of date_ data. so stop heartbeat on the secondary.

caution. if drbd is in the state (or a variation of it) that I suspect,
and that was fixed in 0.7.21, the reboot of the primary in this state of
DRBD may hang or panic. more specifically, in that state any attempt to
change the drbd network conf (drbdadm down, drbdadm disconnect, drbdadm
connect) could couse a kernel hang or panic.

so you may want to consider to stop services, umount drbd,
and then do something like
 echo 1 > /proc/sys/kernel/sysrq; 
 echo s > /proc/sysrq-trigger
 echo u > /proc/sysrq-trigger
 echo b > /proc/sysrq-trigger

that said, my suspicion may be wrong,
and there may be an other way out of it.

grep for drbd related messages in the syslogs or dmessage,
and provide output of 
 ps -eo pid,stat,wchan:40,comm

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.