Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Apologies for possible duplication but it seems I have some email configuration problems so the last attempt may not have reached the list... Hi all I have a pair of servers running DRBD on CentOS 5 that have been chugging along quite happily for about 10 months until yesterday when they both rebooted themselves within 10 seconds of each other. They were running CentOS 5.4 and DRBD 8.2.6 from the CentOS-extras repository but following on from this crash and to fix another problem, I migrated the xen VMs from one of the pair to the other last night, yum updated to CentOS 5.6 and DRBD 8.3.10 (the latest for which I can find an RPM) and migrated the VMs back again. The other one of the pair of servers is untouched - no updates to either o/s or DRBD. The servers are HP DL360 G7's with dual hex core 2.6GHz chips and 12GB RAM, HP P410i RAID controllers with battery backed cache and are used for running xen VMs from LVM volumes (DRBD on top of LVM). There are 5 VMs in total with 7 DRBD devices altogether. Under normal circumstances, 2 VMs run on one server and 3 on the other. Today they did it again. And then several more times - about every 20 minutes in fact. The servers are in a remote data centre and I have no console access and the iLO's on these two servers are not set up and I'm unable to use them so I can see no output on the console. There's no information in /var/log about what the problem is, all I see is that one of the servers reboots itself and then 5 to 10 seconds later, the 2nd one follows it. I've seen from the logs that it's not always the same one that reboots first, sometimes it's one and sometimes the other. The only way I've managed to get the servers out of their 20 minute reboot loop is to stop drbd on one of the pair and migrate all my VMs to run on the other with all the DRBD devices in standalone mode. This seems to me to indicate that DRBD is most probably involved in the reboot. There are 2 other servers in the same rack and attached to the same power bar so I do not think this is power related. Long winded introduction to the question: what circumstances could cause DRBD to initiate a reboot of first one server then the other? Apart from the various resource definitions, /etc/drbd.conf is very minimal and lets a lot of settings take defaults. The allow-two-primaries and cram-hmac-alg settings were added by me last night so that I could migrate the VMs from one server to the other so post-date the start of the problems. global { minor-count 64; usage-count yes; } common { syncer { rate 90M; } startup { degr-wfc-timeout 120; wfc-timeout 0; } protocol C; net { allow-two-primaries; cram-hmac-alg md5; shared-secret "my secret key"; } } resource dns3-disk { device /dev/drbd1; meta-disk internal; on xen23 { disk /dev/xen23/dns3-disk; address 123.123.123.123:7781; } on xen24 { disk /dev/xen24/dns3-disk; address 123.123.123.124:7781; } } Any clues and or help appreciated! Thanks. Trevor