[DRBD-user] Unmounting filesystems after some time freezes the entire system

Thu Jan 11 15:20:41 CET 2007

Hi,

I have a problem with a DRBD/HeartBeat setup on a RedHat AS 4 cluster.
My cluster is fully functional for some time, then if I stop the
services using resources on the /dev/drbdX filesystem, then try simply
to umount the file system, the system heavily crashes, but does not seem
to kernel panic. The time in which I can still unmount without problem
seems to be of some days (I am currently trying to determine it).

The cursor still blinks on the screen, I am losing all services on the
network (no more logging-in possible by SSH, but the system still
answers to ping, which is weird). I have to reboot the server manually
to bring it back up (ie unplugging the server). I cannot get any
information on the crash exact cause, since I lose almost any interface
to the server when it happens. The keyboard is unresponsive
(control-alt-del unusable), but the Num Lock key seems still
functionnal.

I am not wether it may come from DRBD directly, but if I try to unmount
another file system (local, not located on a DRBD device), i don't get
the problem.

I tried to remount RO the file system before trying to unmount it, i got
the same crash at unmount, I achieved to remount it without any problem.
So, it does not seems to come from a write access to the device.

Versions on both nodes of my cluster :

- RedHat AS 4 (up-to-date)
- Linux kernel : kernel-hugemem-2.6.9-42.0.2.EL (RedHat most actual
version)
- Linux kernel headers files : kernel-hugemem-devel-2.6.9-42.0.2.EL
- DRBD software : 0.7.22 (compiled by hand)
- FileSystem type : EXT3 (mount options : defaults,rw)

The most recent RedHat kernel revision (kernel-hugemem-2.6.9-42.0.3.EL)
has been deployed since, and DRBD 0.7.22 recompiled for this kernel and
installed, I cannot yet tell whether it makes the error go away or not.

One more information regarding my configuration, it is an active/active
cluster (2 DRBD resources), and the crash is symmetrical (it happens on
both resources/servers). Please find attached with this mail the
configuration of DRBD.

So it seems to happen only after some time, and only on the DRBD
devices.

Googling for that problem didn't brought me to interesting clues at the
moment.

Does anyone already saw something like that, and by the way has any
solution to debug further or to solve that problem?

Ty by advance.

Charles-Antoine Guillat-Guignard

-------------- next part --------------
global {
    minor-count 2;
    dialog-refresh 1;

    # You might disable one of drbdadm's sanity check.
    # disable-ip-verification;
}

resource res1 {

  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout  90;
    degr-wfc-timeout 180;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
    sndbuf-size 512k;
    timeout       90;    #  6 seconds  (unit = 0.1 seconds)
    connect-int   20;    # 10 seconds  (unit = 1 second)
    ping-int      10;    # 10 seconds  (unit = 1 second)
    max-buffers     4092;
    max-epoch-size  2048;
    ko-count 0;
    on-disconnect reconnect;
  }

  syncer {
    rate 125M;
    group 1;
    al-extents 257;
  }

  on prod0n {
    device     /dev/drbd0;
    disk       /dev/sdb10;
    address    172.17.69.1:7788;
    meta-disk  /dev/sdb9[0];
  }

  on prod1n {
    device    /dev/drbd0;
    disk      /dev/sda2;
    address   172.17.69.2:7788;
    meta-disk /dev/sda1[0];
  }
}

resource res2 {

  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout  90;
    degr-wfc-timeout 180;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
    sndbuf-size 512k;
    timeout       90;    #  6 seconds  (unit = 0.1 seconds)
    connect-int   20;    # 10 seconds  (unit = 1 second)
    ping-int      10;    # 10 seconds  (unit = 1 second)
    max-buffers     4092;
    max-epoch-size  2048;
    ko-count 0;
    on-disconnect reconnect;
  }

  syncer {
    rate 125M;
    group 2;
    al-extents 257;
  }

  on prodn {
    device     /dev/drbd1;
    disk       /dev/sda2;
    address    172.17.96.1:7789;
    meta-disk  /dev/sda1[0];
  }

  on prodn {
    device    /dev/drbd1;
    disk      /dev/sdb10;
    address   172.17.96.2:7789;
    meta-disk /dev/sdb9[0];
  }
}

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Ceci est une partie de message num?riquement sign?e
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070111/6e8b4db8/attachment.pgp>