[DRBD-user] strange device problems after failover

Mon Jan 10 17:16:44 CET 2011

Hi there,

yesterday I did a regular manual fail-over (swap-over) to the second node of a primary/slave drbd cluster.

This is the haresources:
filer01 IPaddr::172.16.1.240/24/bond0 IPaddr::172.16.2.240/24/bond0 Delay::1 drbddisk::cluster_metadata drbddisk::vg0drbd Delay::1 LVM::/dev/vg0 Filesystem::/dev/drbd0::/cluster_metadata::ext3::noatime,nodiratime iscsitarget

The failover didn't succeed because pv/vg/lvscan (don't know which of the lvm part is actually kicked by the heartbeat) didn't find any pv.
I first checked the lvm cache, deleted id, double checked for consistence of configurations on both nodes, but (at least) a manual pvscan
responded it couldn't find any pv signature on my drbd1 (/dev/mapper/vg0drbd in my case). drbd0 is just a flat ext3 FS which worked as expected.

The drbd itself always remained UpToDate/UpToDate

This is a recent /proc/drbd (well, back on the primary because I needed the cluster up and running again)

version: 8.3.9 (api:88/proto:86-95)
GIT-hash: 1c3b2f71137171c1236b497969734da43b5bec90 build by root at filer01, 2011-01-03 07:01:01
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:156691051 dw:156691051 dr:0 al:0 bm:1640 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

On the currently active node, pvscan doesn't have any problems to startup lvm over drbd1...

The previous failover test succeeded; in between, we removed the LUN on the backing store of the
(now) problematic cluster member, created the LUN again with a different stripe size (on HW Raid)
but with all other parameters identical. It's the same bytesize and the same blocksize as before.
After the LUN has been setup, we invalidated it and let drbd sync it in again.

Could anyone shed some light on what could be wrong? Thanks!

Here's the drbd.conf:

global {
 # minor-count 64;
 # dialog-refresh 5; # 5 seconds
 # disable-ip-verification;
 usage-count ask;
}

common {
 syncer { rate 368M; }
}

resource cluster_metadata {
 protocol C;
 handlers {
  pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f";
  pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f";
  local-io-error "echo O > /proc/sysrq-trigger ; halt -f";
  # outdate-peer "/usr/sbin/drbd-peer-outdater";
 }

 startup {
  # wfc-timeout 0;
  degr-wfc-timeout 120; # 2 minutes.
  outdated-wfc-timeout 90;
 }

 disk {
  on-io-error detach;
 }

 net {
  after-sb-0pri disconnect;
  after-sb-1pri disconnect;
  after-sb-2pri disconnect;
  rr-conflict disconnect;
 }

 syncer {
  # rate 10M;
  # after "r2";
  al-extents 3389;
 }

 on filer01 {
  device /dev/drbd0;
  disk /dev/sda4;
  address 192.168.192.1:7788;
  meta-disk internal;
 }

 on filer02 {
  device /dev/drbd0;
  disk /dev/sda4;
  address 192.168.192.2:7788;
  meta-disk internal;
 }
}

resource vg0drbd {
 protocol C;
 startup {
  wfc-timeout 0; ## Infinite!
  degr-wfc-timeout 120; ## 2 minutes.
  outdated-wfc-timeout 90;
 }

 disk {
  no-disk-barrier; ## NUR MIT BBU!
  no-disk-flushes; ## NUR MIT BBU!
  no-disk-drain;
  on-io-error detach;
 }

 net {
  # timeout 60;
  # connect-int 10;
  # ping-int 10;
  max-buffers 8000;
  max-epoch-size 8000;
  sndbuf-size 512k;
 }

 syncer {
  after "cluster_metadata";
  al-extents 3389;
 }

 on filer01 {
  device /dev/drbd1;
  disk /dev/sdb;
  address 192.168.192.1:7789;
  meta-disk internal;
 }

 on filer02 {
  device /dev/drbd1;
  disk /dev/sdb;
  address 192.168.192.2:7789;
  meta-disk internal;
 }
}

Mit freundlichen Gruessen

--
Stephan Seitz
Senior System Administrator

   netz-haut GmbH
   multimediale kommunikation

   Zweierweg 22
   97074 Würzburg

   Telefon: 0931 2876247
   Telefax: 0931 2876248

   Web: www.netz-haut.de
   Amtsgericht Würzburg - HRB 10764
   Geschäftsführer: Michael Daut, Kai Neugebauer