[DRBD-user] problem with drbd reconnection

Wed Jan 11 11:15:43 CET 2006

Christoph Mitasch wrote:
> Hi Nick!
> 
> Have you tried
> drbdadm connect all
> on the Primary?
> 
> Christoph
> 
> On Wed, 2006-01-11 at 10:47 +0100, nick wrote:
> 
>>I have two identical systems with drbd 0.7.10 (api:77) on two systems 
>>running CentOS 4 kernel 2.6.9-5.0.3, they have one resource in common 
>>hadisk, which is a software raid partition used for mail storage.
>>
>>Last night, something went funky on the secondary, and it had to be 
>>restarted (completely ran out of memory, it's not the first time it has 
>>happened), and when it came up, intead of connecting like it normally 
>>does, it just sat there waiting for a connection
>>
>>a cat /proc/drbd on the primary gives me this:
>>
>>  0: cs:WFReportParams st:Primary/Unknown ld:Consistent
>>     ns:75710352 nr:0 dw:80996564 dr:25603549 al:156060 bm:15221 lo:0 
>>pe:0 ua:0 ap:0
>>
>>on the secondary, I get this:
>>
>>0: cs:WFConnection st:Secondary/Unknown ld:Consistent
>>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
>>
>>
>>
>>as it says, waiting for connection, the same message it gives me if I 
>>run drbdadm cstate all.
>>
>>However, if I run drbdam (anything) on the primary, it gives me this 
>>message:
>>
>>Child process does not terminate!
>>Exiting.
>>
>>Which is not good.
>>
>>The problem is, the primary is still up, and accepting data, so I really 
>>don't want to do anything rash, as I can't afford to lose mail (the 
>>secondary is out of date, by how much I'm not sure). What options do I have?
>>
>>I'll gladly give you any information you need, please help me get these 
>>two back and talking, possibly with as little data loss as possible.
>>
>>
>>Nick
>>
>>_______________________________________________
>>drbd-user mailing list
>>drbd-user at lists.linbit.com
>>http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> 
Yes, I have:

Child process does not terminate!
Exiting.

if I do this:

  ps ax | grep drbd
21789 pts/0    D      0:00 /sbin/drbdsetup /dev/drbd0 net 10.0.0.1:7788 
10.0.0.2:7788 C --sndbuf-size=512k --timeout=60 --connect-int=10 
--ping-int=10 --ko-count=4 --on-disconnect=reconnect
22101 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 cstate
22426 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 state
25502 pts/0    S      0:00 /sbin/drbdsetup /dev/drbd0 net 10.0.0.1:7788 
10.0.0.2:7788 C --sndbuf-size=512k --timeout=60 --connect-int=10 
--ping-int=10 --ko-count=4 --on-disconnect=reconnect
25665 pts/0    S+     0:00 grep drbd

it seems like process 21789 is the culprit, do you think I can kill it 
without risking corruption/data loss?