[DRBD-user] Re: drbd 0.7.13 slow resync and panic with RedHat kernel 2.4.21-32.0.1.ELsmp

Diego Liziero diegoliz at carpidiem.it
Thu Sep 15 09:18:19 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars Ellenberg wrote:
>\ Diego Liziero:
>> [..]
>> While the 6th and last drbd partition was syncing, we first noticed a
>> slowdown.
>>
>> The bitrate went down from 480Mbit/sec to about 60Mbit/sec.
>>
>> The link between the 2 nodes of the cluster is a dedicated gigabit
>> ethernet link used only by drbd, we noticed and measured
>> this slowdown using iptraf.
>
>note that to the best of my knowledge iptraf rate measurement is buggy.
>we recently tried to measure performance of iSCSI initiators/targets,
>and nearly went up the wall when we recognized after hours of fruitless
>tuning that the measurement was broken....

I forgot to mention that, during the slowdown,
with cat /proc/drbd we had almost the same results,
and the time to finish switched from about 45 minutes to more than 8
hours.
(we had a while loop with cat /proc/drdb;sleep 5 )

>> The last partition is the bigger one (250G), and after 10% of the
>> resync process, the primary cluster hanged. The console was black,
>> the keyboard not responding, we had to press the reset button.
>
>this however is interessting.
>does this device sync successfully
> - if it is the only configured device?
> - if you configure fewer devices?
> - if you reorder it, i.e. it comes not last first?
> - if you reorder sync groups, i.e. it is not synced last?

We didn't try to change the configuration.

We can do some more tests next wednesday afternoon,
but we have few minutes available where all
the cluster services may be stopped.

Yesterday, before the sync, all partitons were forced primary
and we started all services in the last primary
node  (the cluster is in production
since august 2004 and never stopped with drbd 0.6.12).

After the first resync and lockup, all the other devices got
synced very quickly with the metadisk data, and the last one
always locked the system. And when we checked the speed
we noticed the slowdown before the lock.

Here is what is left on my remote xterminal from yesterday:
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root at .. , 2005-09-12 12:44:26
 0: cs:Connected st:Primary/Secondary ld:Consistent
    ns:50789466 nr:0 dw:40 dr:50789583 al:3 bm:9300 lo:0 pe:0 ua:0 ap:0
 1: cs:Connected st:Primary/Secondary ld:Consistent
    ns:8555234 nr:0 dw:170636 dr:8451935 al:207 bm:1536 lo:0 pe:0 ua:0
ap:0
 2: cs:Connected st:Primary/Secondary ld:Consistent
    ns:8615170 nr:0 dw:231256 dr:8434619 al:43 bm:1536 lo:0 pe:0 ua:0
ap:0
 3: cs:Connected st:Primary/Secondary ld:Consistent
    ns:8397222 nr:0 dw:12184 dr:8389563 al:56 bm:1536 lo:0 pe:0 ua:0
ap:0
 4: cs:Connected st:Primary/Secondary ld:Consistent
    ns:8385898 nr:0 dw:145 dr:8534575 al:2 bm:1536 lo:0 pe:0 ua:0 ap:0
 5: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:4716200 nr:0 dw:60688 dr:5035525 al:250 bm:32608 lo:0 pe:53
ua:2019 ap:0
        [>...................] sync'ed:  1.8% (254025/258577)M
        finish: 7:22:23 speed: 9,756 (8,336) K/sec
 6: cs:Unconfigured
 7: cs:Unconfigured

The conf file part of the 5th partition is:

resource Mail {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; #
halt -f";
  startup { wfc-timeout 0; degr-wfc-timeout 120; }
  disk    { on-io-error detach; }
  net     { timeout 60; connect-int 10; ping-int 10;
            max-buffers 2048; max-epoch-size 2048;
            ko-count 4; on-disconnect reconnect; }
  syncer  {
    rate 500M;
    group 6;
  }
  on oberon {
    device /dev/drbd5;
    disk /dev/cciss/c0d0p12;
    address 10.2.1.2:7793;
    meta-disk /dev/cciss/c0d0p14 [5];
  }
  on giano {
    device /dev/drbd5;
    disk /dev/cciss/c0d0p12;
    address 10.2.1.1:7793;
    meta-disk /dev/cciss/c0d0p14 [5];
  }
}

>> We tried this process various times, and with different versions
>> of the 2.4.21smp kernel and all with a new (recompiled
>> each time) 0.7.13 drbd module.
>>
>> In all cases we got a system hang during the resync, sometimes
>> with a slowdown of the sync rate some minutes before the hang.
>>
>> In one case we were able to see an Oops message on the console,
>> but unfortunately just the last lines were visible
>> (I remember something about tasker, irq and smp)
>> and shift-pageup was not working.
>
>try to grab that with a serial console.
>
>==> make sure you have NMI watchdog enabled in your kernel <==
>to better detect deadlocks.

It was not. Just added nmi_watchdog=1 in grub.conf kernel line.

Regards,
Diego.





More information about the drbd-user mailing list