[DRBD-user] [Fwd: Possible reasons for random reboots?]

Mon Aug 1 13:58:55 CEST 2011

Apologies for possible duplication but it seems I have some email 
configuration problems so the last attempt may not have reached the list...

Hi all

I have a pair of servers running DRBD on CentOS 5 that have been 
chugging along quite happily for about 10 months until yesterday when 
they both rebooted themselves within 10 seconds of each other. They were 
running CentOS 5.4 and DRBD 8.2.6 from the CentOS-extras repository but 
following on from this crash and to fix another problem, I migrated the 
xen VMs from one of the pair to the other last night, yum updated to 
CentOS 5.6 and DRBD 8.3.10 (the latest for which I can find an RPM) and 
migrated the VMs back again. The other one of the pair of servers is 
untouched - no updates to either o/s or DRBD. The servers are HP DL360 
G7's with dual hex core 2.6GHz chips and 12GB RAM, HP P410i RAID 
controllers with battery backed cache and are used for running xen VMs 
from LVM volumes (DRBD on top of LVM). There are 5 VMs in total with 7 
DRBD devices altogether. Under normal circumstances, 2 VMs run on one 
server and 3 on the other.

Today they did it again. And then several more times - about every 20 
minutes in fact. The servers are in a remote data centre and I have no 
console access and the iLO's on these two servers are not set up and I'm 
unable to use them so I can see no output on the console. There's no 
information in /var/log about what the problem is, all I see is that one 
of the servers reboots itself and then 5 to 10 seconds later, the 2nd 
one follows it. I've seen from the logs that it's not always the same 
one that reboots first, sometimes it's one and sometimes the other. The 
only way I've managed to get the servers out of their 20 minute reboot 
loop is to stop drbd on one of the pair and migrate all my VMs to run on 
the other with all the DRBD devices in standalone mode. This seems to me 
to indicate that DRBD is most probably involved in the reboot.

There are 2 other servers in the same rack and attached to the same 
power bar so I do not think this is power related.

Long winded introduction to the question: what circumstances could cause 
DRBD to initiate a reboot of first one server then the other? Apart from 
the various resource definitions, /etc/drbd.conf is very minimal and 
lets a lot of settings take defaults. The allow-two-primaries and 
cram-hmac-alg settings were added by me last night so that I could 
migrate the VMs from one server to the other so post-date the start of 
the problems.

global {
  minor-count 64;
  usage-count yes;
}

common {
  syncer { rate 90M; }
  startup {
    degr-wfc-timeout 120;
    wfc-timeout 0;
  }
  protocol C;
  net {
        allow-two-primaries;
        cram-hmac-alg md5;
        shared-secret "my secret key";
        }
}

resource dns3-disk {
  device     /dev/drbd1;
  meta-disk  internal;
  on xen23 {
    disk       /dev/xen23/dns3-disk;
    address    123.123.123.123:7781;
  }
  on xen24 {
    disk       /dev/xen24/dns3-disk;
    address    123.123.123.124:7781;
  }
}

Any clues and or help appreciated!

Thanks.

Trevor