[DRBD-user] Re: drbd 0.7.13 slow resync and panic with RedHat kernel 2.4.21-32.0.1.ELsmp

Lars Ellenberg Lars.Ellenberg at linbit.com
Thu Sep 15 11:39:57 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2005-09-15 09:18:19 +0200
\ Diego Liziero:
> 
> I forgot to mention that, during the slowdown,
> with cat /proc/drbd we had almost the same results,
> and the time to finish switched from about 45 minutes to more than 8
> hours.
> (we had a while loop with cat /proc/drdb;sleep 5 )

watch -n5 cat /proc/drbd  # does it neatly...
ok.

> >> The last partition is the bigger one (250G), and after 10% of the
> >> resync process, the primary cluster hanged. The console was black,
> >> the keyboard not responding, we had to press the reset button.
> >
> >this however is interessting.
> >does this device sync successfully
> > - if it is the only configured device?
> > - if you configure fewer devices?
> > - if you reorder it, i.e. it comes not last first?
> > - if you reorder sync groups, i.e. it is not synced last?
> 
> We didn't try to change the configuration.
> 
> We can do some more tests next wednesday afternoon,
> but we have few minutes available where all
> the cluster services may be stopped.

so what is the current situation?  back to 0.6?

> Yesterday, before the sync, all partitons were forced primary
> and we started all services in the last primary
> node  (the cluster is in production
> since august 2004 and never stopped with drbd 0.6.12).
> 
> After the first resync and lockup, all the other devices got
> synced very quickly with the metadisk data, and the last one
> always locked the system. And when we checked the speed
> we noticed the slowdown before the lock.

good. well, in a way.  since it is that reproducible,
we should be able to track down the cause.

> Here is what is left on my remote xterminal from yesterday:
> version: 0.7.13 (api:77/proto:74)
> SVN Revision: 1942 build by root at .. , 2005-09-12 12:44:26
>  0: cs:Connected st:Primary/Secondary ld:Consistent
>     ns:50789466 nr:0 dw:40 dr:50789583 al:3 bm:9300 lo:0 pe:0 ua:0 ap:0
>  1: cs:Connected st:Primary/Secondary ld:Consistent
>     ns:8555234 nr:0 dw:170636 dr:8451935 al:207 bm:1536 lo:0 pe:0 ua:0 ap:0
>  2: cs:Connected st:Primary/Secondary ld:Consistent
>     ns:8615170 nr:0 dw:231256 dr:8434619 al:43 bm:1536 lo:0 pe:0 ua:0 ap:0
>  3: cs:Connected st:Primary/Secondary ld:Consistent
>     ns:8397222 nr:0 dw:12184 dr:8389563 al:56 bm:1536 lo:0 pe:0 ua:0 ap:0
>  4: cs:Connected st:Primary/Secondary ld:Consistent
>     ns:8385898 nr:0 dw:145 dr:8534575 al:2 bm:1536 lo:0 pe:0 ua:0 ap:0
>  5: cs:SyncSource st:Primary/Secondary ld:Consistent
>     ns:4716200 nr:0 dw:60688 dr:5035525 al:250 bm:32608 lo:0 pe:53 ua:2019 ap:0
>         [>...................] sync'ed:  1.8% (254025/258577)M
>         finish: 7:22:23 speed: 9,756 (8,336) K/sec

is it already mounted and serving?

> The conf file part of the 5th partition is:
> 
> resource Mail {
>   protocol C;
>   syncer  {
>     rate 500M;

maybe better set this to something like what is actually physically possible.
say, if you have seen a maximum sync transfer rate of 27M, set it to 30M.

but that should not have anything to do with the problem at hand.

>     group 6;

ok, so I guess all of them are in different sync groups, right?

>   }
>     disk /dev/cciss/c0d0p12;
>     meta-disk /dev/cciss/c0d0p14 [5];

looks fine.

can you somehow verify that the 
     meta-disk /dev/cciss/c0d0p14 [5];
has no physical problems?

whenever a certain amount of bitmap is cleared,
the corresponding bitmap sector is written to meta-disk,
synchronously. if that write "never completes" for some reason
(the lower lever driver retrying "forever"),
drbd and possibly the system may apear to be hung.

> >> We tried this process various times, and with different versions
> >> of the 2.4.21smp kernel and all with a new (recompiled
> >> each time) 0.7.13 drbd module.
> >>
> >> In all cases we got a system hang during the resync, sometimes
> >> with a slowdown of the sync rate some minutes before the hang.
> >>
> >> In one case we were able to see an Oops message on the console,
> >> but unfortunately just the last lines were visible
> >> (I remember something about tasker, irq and smp)
> >> and shift-pageup was not working.
> >
> >try to grab that with a serial console.
> >
> >==> make sure you have NMI watchdog enabled in your kernel <==
> >to better detect deadlocks.
> 
> It was not. Just added nmi_watchdog=1 in grub.conf kernel line.

thanks.
I'm confident that we are able to find the cause,
since you seem to be able to reproduce it.

would be better if you could reproduce it on a non-production cluster,
of course.

cheers,

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list