[DRBD-user] Oopses/timeouts on 0.7.5 with many big volumes

Andreas Huck drbd at huck.it
Tue Nov 30 15:00:34 CET 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

On Tuesday 30 November 2004 13:34, Philipp Reisner wrote:
> > > > If I start again:
> > > >
> > > > Starting DRBD resources:    [vol21][vol22]Child process does not
> > > > terminate! Exiting.
> > > >
> > > > and so on : one more volume is completed every time I launch the
> > > > script. Looking at the source code, I believe the timeout has to do
> > > > with the SLEEPS_LONG / SLEEPS_VERY_LONG values that probably aren't
> > > > enough in my case (but i'm not sure if I'm hitting the "long" or "very
> > > > long" timeout). Can these be made into module parameters, or at least
> > > > #define's ?
> > > >
> > > > Eventually, after enough launches, all volumes are corrected
> > > > initialized, except...
> > >
> > > Yes, I have changed SLEEPS_LONG from 60 to 120 Seconds with the
> > > drbd-0.7.6 release.
> >
> > this might not really solve this issue, since the "child process" (the one
> > which does not terminate) stays until the corresponding device is in sync.
> > Couldn't drbd check whether the child syncs and went on with the other
> > devices? In fact, drbd might first check, which devices are in the highest
> > sync group (e.g. group=1)  and start the corresponding child processes for
> > parallel sync. The others might be schedduled accordingly for sync. Just my
> > thoughts. Is there any necessarity for waiting? I consider "Exiting" as a
> > bug in a HA context. I stumbeled on the same recently.
> >
> 
> Renaud and I were refering to "drbdsetup /dev/drbdX disk ..." while you
> are refering to "drbdstup /dev/drbdX wait_sync" probabely...

I follow the same procedure Renaud described: /etc/init.d/drbd start - which
is essentially:
A) # /sbin/drbdadm wait_connect all
or (another similar failure)
B) # /sbin/drbdadm up drbdX

After STONITH and restart of fsrv1 - meanwhile fsrv2 is active - drbd does
not connect all devices as desired. This may work with a higher timeout, but
if the AL is not sparse, resyncing might take some more minutes on bigger
volumes.  

A) 

fsrv1:~ # rcdrbd start
Starting DRBD resources:    [drbd0][drbd1][drbd2]
Waiting until resources are connected (or timeouted)ioctl(,WAIT_*,) failed: Timer expired
drbdsetup exited with code 20
.

While waiting:

fsrv2:~ # cat /proc/drbd
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:0 nr:0 dw:460 dr:887846 al:2 bm:76 lo:1440 pe:0 ua:1440 ap:0
        [================>...] sync'ed: 83.7% (171652/1052676)K
        finish: 0:00:30 speed: 5,538 (5,275) K/sec
 1: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:80 dw:140 dr:58362 al:2 bm:2 lo:0 pe:0 ua:0 ap:0
 2: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:60 dw:100 dr:58020 al:1 bm:1 lo:0 pe:0 ua:0 ap:0
fsrv2:/dev/vgs80a # cat /proc/drbd
version: 0.7-pre8 (api:74/proto:72)

And about two minutes later after expiration:

fsrv2:~ # cat /proc/drbd
 0: cs:Connected st:Primary/Secondary ld:Consistent
    ns:0 nr:0 dw:460 dr:1053738 al:2 bm:94 lo:0 pe:0 ua:0 ap:0
 1: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:80 dw:140 dr:58362 al:2 bm:2 lo:0 pe:0 ua:0 ap:0
 2: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:60 dw:100 dr:58020 al:1 bm:1 lo:0 pe:0 ua:0 ap:0

So the system comes up, but does not recover fully.

And B)

fsrv1:~ # /etc/init.d/drbd start
Starting DRBD resources:    [drbd0]Child process does not terminate!
Exiting.

 Failed setting up drbd1
--

fsrv2:~ # cat /proc/drbd
version: 0.7-pre8 (api:74/proto:72)

 0: cs:SyncTarget st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:324580 dw:1394488 dr:2111612 al:4 bm:277 lo:0 pe:1267 ua:0 ap:0
        [======>.............] sync'ed: 30.7% (730060/1052672)K
        finish: 0:02:13 speed: 5,474 (5,376) K/sec
 1: cs:WFConnection st:Primary/Unknown ld:Consistent
    ns:0 nr:3848 dw:1056588 dr:2111268 al:1 bm:395 lo:0 pe:0 ua:0 ap:0
 2: cs:WFConnection st:Primary/Unknown ld:Consistent
    ns:0 nr:3724 dw:1056400 dr:2111072 al:1 bm:396 lo:0 pe:0 ua:0 ap:0


Regards,
Andreas

> 
> -Phil



More information about the drbd-user mailing list