Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
We don't have fencing configured, this pair has no Pacemaker or anything like that - its purely manual failover. IIRC we suspected that the fencing handlers may have been causing the very occasional reboots we had seen so disabled the reboot calls in the fencing config. The handlers currently looks like this: pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh;"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh;"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh;"; ... but we don't see any notifications or anything logged to suggest fencing has been called. On 11 January 2012 17:15, Digimer <linux at alteeve.com> wrote: > On 01/11/2012 11:56 AM, Adam Wilbraham wrote: >> Anyway, we moved the servers to a new location over the weekend and >> almost from the point of power up this parallel reboot issue reared >> its head. If the servers were sat there idling they would be fine, but >> the minute I started to boot up domUs I began the risk of it >> happening. Normally the more domUs running, the more likely it was to >> kick a reboot. It seemed like it was most likely to happen when >> starting another domU rather than just doing its running of online >> VMs. Anyway, I eventually narrowed this down to a point where I >> realised that if I unplugged the network cable that was being used for >> DRBD replication then the server would spew output to the screen and >> reboot instantly, with the other one in the pair going about a second >> later. > > If you have fencing configured, as you should, then you could have been > seeing a dual-fence problem. Basically, both nodes send off their kill > commands before one of them die. I believe this is a known issue with > some iLO based fencing, but I can't quote source. Generally, the test is > to put a 5sec delay into the fence script on one node. If it then > reliably dies first and the other node lives, you found the issue. > > You *are* using fencing, right? ;) > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron -- Adam Wilbraham Senior Systems Administrator Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE t: +44 (0)114 2212123 e: adam.wilbraham at technophobia.com w: http://www.technophobia.com/ http://twitter.com/WeTechnophobia Part of the Capita Group: www.capita.co.uk Registered in England and Wales Company No. 3063669 VAT registration No. 618 1841 40 ISO 9001:2000 Accredited Company No. 21227 ISO 14001:2004 Accredited Company No. E997 ISO 27001:2005 (BS7799) Accredited Company No. IS 508906 Investor in People Certified No. 101507 The contents of this email are confidential to the addressee and are intended solely for the recipients use. If you are not the addressee, you have received this email in error. Any disclosure, copying, distribution or action taken in reliance on it is prohibited and may be unlawful. Any opinions expressed in this email are those of the author personally and not Technophobia Limited who do not accept responsibility for the contents of the message. All email communications, in and out of Technophobia, are recorded for monitoring purposes.