[DRBD-user] Re: drbd 0.7.13 slow resync and panic with RedHat kernel 2.4.21-32.0.1.ELsmp

Wed Sep 14 20:31:53 CEST 2005

/ 2005-09-14 18:49:03 +0200
\ Diego Liziero:
> Hello,
> today we tried to update our drbd 0.6.x system to 0.7.13
> using a free disk partition as meta-disk.
> 
> We followed all the update instructions and we got the first
> 5 drbd partitions in sync with the new 0.7.13 version.
> 
> While the 6th and last drdb partition was syncing, we first noticed a
> slowdown.
> 
> The bitrate went down from 480Mbit/sec to about 60Mbit/sec.
> 
> The link between the 2 nodes of the cluster is a dedicated gigabit
> ethernet link used only by drbd, we noticed and measured
> this slowdown using iptraf.

note that to the best of my knowledge iptraf rate measurement is buggy.
we recently tried to measure performance of iSCSI initiators/targets,
and nearly went up the wall when we recognized after hours of fruitless
tuning that the measurement was broken....

> The last partition is the bigger one (250G), and after 10% of the 
> resync process, the primary cluster hanged. The console was black,
> the keyboard not responding, we had to press the reset button.

this however is interessting.
does this device sync successfully
 - if it is the only configured device?
 - if you configure fewer devices?
 - if you reorder it, i.e. it comes not last first?
 - if you reorder sync groups, i.e. it is not synced last?

> We tried this process various times, and with different versions
> of the 2.4.21smp kernel and all with a new (recompiled
> each time) 0.7.13 drdb module.
> 
> In all cases we got a system hang during the resync, sometimes
> with a slowdown of the sync rate some minutes before the hang.
> 
> In one case we were able to see an Oops message on the console,
> but unfortunately just the last lines were visible
> (I remember something about tasker, irq and smp)
> and shift-pageup was not working.

try to grab that with a serial console.

==> make sure you have NMI watchdog enabled in your kernel <==
to better detect deadlocks.

> Our system is a cluster with 2 servers each one with 4 Xeon
> processors and 7 Gb of RAM.
> 
> The same kernel version works fine with drbd 0.6.12

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.