Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2006-09-22 16:16:27 +0100 \ Mark Olliver: > Hi, > > > > I have just noticed my drbd cluster has a status of > > version: 0.7.19 (api:78/proto:74) > > SVN Revision: 2212 build by root at cp1.thermeoneurope.com, 2006-05-30 12:57:03 > > 0: cs:NetworkFailure st:Primary/Secondary ld:Consistent > > ns:384452 nr:4880 dw:137663608 dr:34555860 al:102939 bm:22058 lo:0 pe:0 > ua:0 ap:0 > > > > This was the same on both the primary and secondary (obviously just the > other way around). As I still have live people working on the primary I > tried taking down the secondary and restarting it. This seamed ok now the > secondary though is in the following state: > > [root at ie-openvz1 ~]# cat /proc/drbd > > version: 0.7.19 (api:78/proto:74) > > SVN Revision: 2212 build by root at cp-bk.thermeoneurope.com, 2006-05-30 > 15:22:21 > > 0: cs:WFConnection st:Secondary/Unknown ld:Consistent > > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 > > > > Still not seeing the primary. I can ping between the two ok, there are no > firewall rules blocking them and have been working fine until now. I do not > really want to take the primary down as there are users live on it. > > > > The two machines are separated one is in Dublin the other in London > connected over the internet via vpn. you probably got bitten by some variant of what is referenced in the changelog as 0.7.21 ... * Fixed the "stalled in WFParams" after reconnect symptom. The cause of this bug was actually a misuse of the data socket. > Any ideas on how to get these two to talk again gratefully received. NetWorkFailure normally is a transient state, that should go right into WFConnection (or, in some other cases) into StandAlone. since it is stuck somewhere on the way, there is probably no good way to recover short of a reboot (or maybe forced module unload, but I won't do that). make sure your secondary won't "accidentally" take over, since it has consistent, but _out of date_ data. so stop heartbeat on the secondary. caution. if drbd is in the state (or a variation of it) that I suspect, and that was fixed in 0.7.21, the reboot of the primary in this state of DRBD may hang or panic. more specifically, in that state any attempt to change the drbd network conf (drbdadm down, drbdadm disconnect, drbdadm connect) could couse a kernel hang or panic. so you may want to consider to stop services, umount drbd, and then do something like echo 1 > /proc/sys/kernel/sysrq; echo s > /proc/sysrq-trigger echo u > /proc/sysrq-trigger echo b > /proc/sysrq-trigger that said, my suspicion may be wrong, and there may be an other way out of it. grep for drbd related messages in the syslogs or dmessage, and provide output of ps -eo pid,stat,wchan:40,comm -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.