WFReportParams stuck? 0.7.10 (was Re: [DRBD-user] problem with drbd reconnection)

Jonathan Soong jon.soong at imvs.sa.gov.au
Wed Feb 8 01:03:02 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi guys

I currently have the exact same problem on a remote site.

0.7.10, FC3

The Primary reports:
1: cs:WFReportParams st:Primary/Unknown ld:Consistent
    ns:7473872 nr:42155372 dw:56393648 dr:158987165 al:28632 bm:21600 
lo:0 pe:0 ua:0 ap:0

The Secondary reports:
1: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:0 dw:9480028 dr:53758477 al:1775 bm:3299 lo:0 pe:0 ua:0 ap:0

I have tried to reconnect on the Primary:
$> drbdadm connect <resource name>
"Child process does not terminate!
Exiting."

In my process tree i now see:
root      4307     1  0  2005 ?        00:01:37 [drbd0_worker]
root      4338     1  0  2005 ?        00:05:04 [drbd0_receiver]
root      4346     1  0  2005 ?        00:09:47 [drbd1_receiver]
root      4354     1  0  2005 ?        00:06:28 [drbd2_receiver]
root      4378     1  0  2005 ?        00:03:25 [drbd0_asender]
root     29279     1  0 Feb02 ?        00:00:02 [drbd2_worker]
root     29280     1  0 Feb02 ?        00:00:26 [drbd2_asender]
root     29314     1  0 10:17 pts/1    00:00:00 /sbin/drbdsetup 
/dev/drbd1 net 192.168.0.1:7789 192.168.0.2:7789 C
root     29481     1  0 10:29 pts/0    00:00:00 /sbin/drbdsetup 
/dev/drbd1 net 192.168.0.1:7789 192.168.0.2:7789 C
(I tried the reconnect twice)

You can see in the above that:
- There are 2  'worker' processes
- There are 3 'receiver' processes
- There are 2 'asender' processes

I presume that the extra 'receiver' process for drbd1 (which is my 
resource) is the thing that is not dying or is locked.
Is there any way to force this to stop so i can connect the resource?
Is there anyway to get these machines back in sync without rebooting?

These machines are in production and kind of difficult to get to.

Thanks for any help.

Cheers

Jon



nick wrote:

> Christoph Mitasch wrote:
>
>> Hi Nick!
>>
>> Have you tried
>> drbdadm connect all
>> on the Primary?
>>
>> Christoph
>>
>> On Wed, 2006-01-11 at 10:47 +0100, nick wrote:
>>
>>> I have two identical systems with drbd 0.7.10 (api:77) on two 
>>> systems running CentOS 4 kernel 2.6.9-5.0.3, they have one resource 
>>> in common hadisk, which is a software raid partition used for mail 
>>> storage.
>>>
>>> Last night, something went funky on the secondary, and it had to be 
>>> restarted (completely ran out of memory, it's not the first time it 
>>> has happened), and when it came up, intead of connecting like it 
>>> normally does, it just sat there waiting for a connection
>>>
>>> a cat /proc/drbd on the primary gives me this:
>>>
>>>  0: cs:WFReportParams st:Primary/Unknown ld:Consistent
>>>     ns:75710352 nr:0 dw:80996564 dr:25603549 al:156060 bm:15221 lo:0 
>>> pe:0 ua:0 ap:0
>>>
>>> on the secondary, I get this:
>>>
>>> 0: cs:WFConnection st:Secondary/Unknown ld:Consistent
>>>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
>>>
>>>
>>>
>>> as it says, waiting for connection, the same message it gives me if 
>>> I run drbdadm cstate all.
>>>
>>> However, if I run drbdam (anything) on the primary, it gives me this 
>>> message:
>>>
>>> Child process does not terminate!
>>> Exiting.
>>>
>>> Which is not good.
>>>
>>> The problem is, the primary is still up, and accepting data, so I 
>>> really don't want to do anything rash, as I can't afford to lose 
>>> mail (the secondary is out of date, by how much I'm not sure). What 
>>> options do I have?
>>>
>>> I'll gladly give you any information you need, please help me get 
>>> these two back and talking, possibly with as little data loss as 
>>> possible.
>>>
>>>
>>> Nick
>>>
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>>
> Yes, I have:
>
> Child process does not terminate!
> Exiting.
>
>
> if I do this:
>
>  ps ax | grep drbd
> 21789 pts/0    D      0:00 /sbin/drbdsetup /dev/drbd0 net 
> 10.0.0.1:7788 10.0.0.2:7788 C --sndbuf-size=512k --timeout=60 
> --connect-int=10 --ping-int=10 --ko-count=4 --on-disconnect=reconnect
> 22101 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 cstate
> 22426 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 state
> 25502 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 net 
> 10.0.0.1:7788 10.0.0.2:7788 C --sndbuf-size=512k --timeout=60 
> --connect-int=10 --ping-int=10 --ko-count=4 --on-disconnect=reconnect
> 25665 pts/0    S+     0:00 grep drbd
>
>
> it seems like process 21789 is the culprit, do you think I can kill it 
> without risking corruption/data loss?
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

-- 
**************************************************
Jonathan Soong
Institute of Medical and Veterinary Science
Information, Communication and Technology Services
www.imvs.org Ph: +61 8 8222 3095




More information about the drbd-user mailing list