Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Apologies for possible duplication but it seems I have some email
configuration problems so the last attempt may not have reached the list...
Hi all
I have a pair of servers running DRBD on CentOS 5 that have been
chugging along quite happily for about 10 months until yesterday when
they both rebooted themselves within 10 seconds of each other. They were
running CentOS 5.4 and DRBD 8.2.6 from the CentOS-extras repository but
following on from this crash and to fix another problem, I migrated the
xen VMs from one of the pair to the other last night, yum updated to
CentOS 5.6 and DRBD 8.3.10 (the latest for which I can find an RPM) and
migrated the VMs back again. The other one of the pair of servers is
untouched - no updates to either o/s or DRBD. The servers are HP DL360
G7's with dual hex core 2.6GHz chips and 12GB RAM, HP P410i RAID
controllers with battery backed cache and are used for running xen VMs
from LVM volumes (DRBD on top of LVM). There are 5 VMs in total with 7
DRBD devices altogether. Under normal circumstances, 2 VMs run on one
server and 3 on the other.
Today they did it again. And then several more times - about every 20
minutes in fact. The servers are in a remote data centre and I have no
console access and the iLO's on these two servers are not set up and I'm
unable to use them so I can see no output on the console. There's no
information in /var/log about what the problem is, all I see is that one
of the servers reboots itself and then 5 to 10 seconds later, the 2nd
one follows it. I've seen from the logs that it's not always the same
one that reboots first, sometimes it's one and sometimes the other. The
only way I've managed to get the servers out of their 20 minute reboot
loop is to stop drbd on one of the pair and migrate all my VMs to run on
the other with all the DRBD devices in standalone mode. This seems to me
to indicate that DRBD is most probably involved in the reboot.
There are 2 other servers in the same rack and attached to the same
power bar so I do not think this is power related.
Long winded introduction to the question: what circumstances could cause
DRBD to initiate a reboot of first one server then the other? Apart from
the various resource definitions, /etc/drbd.conf is very minimal and
lets a lot of settings take defaults. The allow-two-primaries and
cram-hmac-alg settings were added by me last night so that I could
migrate the VMs from one server to the other so post-date the start of
the problems.
global {
minor-count 64;
usage-count yes;
}
common {
syncer { rate 90M; }
startup {
degr-wfc-timeout 120;
wfc-timeout 0;
}
protocol C;
net {
allow-two-primaries;
cram-hmac-alg md5;
shared-secret "my secret key";
}
}
resource dns3-disk {
device /dev/drbd1;
meta-disk internal;
on xen23 {
disk /dev/xen23/dns3-disk;
address 123.123.123.123:7781;
}
on xen24 {
disk /dev/xen24/dns3-disk;
address 123.123.123.124:7781;
}
}
Any clues and or help appreciated!
Thanks.
Trevor