[DRBD-user] One of two nodes in DRBD cluster has a strange problem

Sun Mar 1 16:33:22 CET 2009

Hi,

I am having a weird problem with two DRBD machines. These machines are
exactly the same in hardware and software and are both running Debian
Etch and DRBD 8.0.14.
These two machines run Heartbeat and the IETD iSCSI target (latest
stable version) in active/passive setup. The primary node is "stor1",
the secondary "stor2".
They have a direct Gigabit connection between each other, dedicated for DRBD.
The problem is that Xen VMs running from this iSCSI target crash on
high disk load, because (presumably) the iSCSI sessions time out.

This only occurs when:

- The two machines ("stor1" and "stor2") are both online, so stor1 is
the active node, syncing to stor2.
- stor1 is offline, so stor2 is the active node (and there is no DRBD syncing).

This does not occur when:

- stor2 is offline, so stor1 is the active node (there is no DRBD
syncing between the nodes).

Sometimes, a few seconds before one of these crashes, this error
appears in the syslog, only of stor1. Note that this does not happen
every time:
drbd0: [drbd0_worker/26335] sock_sendmsg time expired, ko = 3
I am currently running just from stor1, and have had no crashes.

>From this I conclude that the problem must lay somewhere in the stor2
machine. I just can't find out where.
Both machines are running from A-brand hardware RAID, so disks can't
be the problem.

The machines have both Intel Gigabit Ethernet ports and on-board
nVidia Gigabit Ethernet ports.
Letting DRBD sync over another (brand) network interface does not make
any difference. I am running the latest Intel e1000 drivers for these
interfaces,
hence I do not think the network is the problem.

Following is my drbd.conf, which is pretty default, but just in case.
If any more information is needed, I'd be glad to supply it.

Thanks,
Rik

### /etc/drbd.conf ###

global {
    usage-count yes;
}
common {
}

resource resource0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
    ko-count 4;
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 90M;
    al-extents 257;
  }

  on stor1 {
    device     /dev/drbd0;
    disk       /dev/sda9;
    address    192.168.3.121:7788;
    meta-disk  /dev/sda8 [0];
  }

  on stor2 {
    device     /dev/drbd0;
    disk       /dev/sda9;
    address    192.168.3.122:7788;
    meta-disk  /dev/sda8 [0];
  }
}