Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I've spent the past couple of days trying to get a pair of servers into a state of stability and thought I would try and get down my issues into a post & possible bug report whilst I have them in my head. I'm probably going to miss some bits of information out, but I'll try and get down what I remember. Its been a busy couple of days & the combinations I've tried might mean that none of this is useful for debugging because I haven't logged down as much as I should have done as I was going through the troubleshooting process as I've been blocking colleagues and therefore against the clock. I started out with a pair of HP DL360 G6's which were built approximately a year ago, running Debian Squeeze (before it became stable) with Xen 4.0 on top all from Apt and with DRBD 8.3.10 built from source (I believe this was stable at the time). The pair of servers have only been used as internal development hosts and were never patched up when Debian went stable, so the kernel version was a little out of date (xen-linux-system-2.6.32-5-xen-amd64_2.6.32-30_amd64.deb) as were other packages. Over the last year the pair have been stable almost all of the time, but we did have a couple of incidents where the pair would reboot in tandem but because they weren't business critical the resolution of this was never a priority. Anyway, we moved the servers to a new location over the weekend and almost from the point of power up this parallel reboot issue reared its head. If the servers were sat there idling they would be fine, but the minute I started to boot up domUs I began the risk of it happening. Normally the more domUs running, the more likely it was to kick a reboot. It seemed like it was most likely to happen when starting another domU rather than just doing its running of online VMs. Anyway, I eventually narrowed this down to a point where I realised that if I unplugged the network cable that was being used for DRBD replication then the server would spew output to the screen and reboot instantly, with the other one in the pair going about a second later. First thing I thought here was that I'm massively out of the date on patches, so lets apt-get update & apt-get dist-upgrade - this brought a new kernel with it (xen-linux-system-2.6.32-5-xen-amd64_2.6.32-38_amd64.deb) and at the same time I think I decided it was probably wise to go to the latest DRBD (8.4.1) so built the module and tools and off I went. This brought an end to the random reboots, but it also brought new problems which seemed to suggest that Xen could no longer properly access the disk subsystem being exposed paravirtually into a domU if the domU was using the Debian Squeeze kernel. I would get error messages from the kernel within the domU saying that processes have been blocked for 120 seconds on startup and it never completed.At the same time, I was seeing kernel oops messages on the console with a large hex string being pasted out. There may be some stuff in my kern.log of relevance here actually - I'll have to have a fish around. For reference, domUs which I had running with Etch and Lenny kernels wouldn't exhibit this problem and they booted fine. I did some looking around and found various references to bugs in the current Squeeze kernel, and suggestions to try the one from proposed updates (2.6.32-40). Unfortunately, this didn't make any difference to my problem. As this was so far all looking kernel related I went looking for a newer prebuilt kernel which I could try so first of all pulled linux-image-2.6.39-bpo.2-amd64_2.6.39-3~bpo60+1_amd64.deb from Squeeze backports, however this doesn't have blkback support so this meant that its no good as a dom0 kernel. I then went and grabbed linux-image-3.1.0-1-amd64_3.1.6-1_amd64.deb from Wheezy / Testing and found that this works absolutely fine for me. Due to it being built with gcc-4.6 I'm not in a position to build DRBD from source without another chunk of work, so for the quickest reverted back to the in kernel version (8.3.11) and grabbed a matching tools deb (albeit from Ubuntu) and lo and behold I appear to have reached a point of stability. So to sum up - this pair is currently now running Squeeze (including all proposed updates) + Wheezy Kernel + drbd8-utils_2%3a8.3.7-2.1_amd64.deb from Ubuntu and I finally seem to have reached a working state again. -- Adam Wilbraham Senior Systems Administrator Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE t: +44 (0)114 2212123 e: adam.wilbraham at technophobia.com w: http://www.technophobia.com/ http://twitter.com/WeTechnophobia Part of the Capita Group: www.capita.co.uk Registered in England and Wales Company No. 3063669 VAT registration No. 618 1841 40 ISO 9001:2000 Accredited Company No. 21227 ISO 14001:2004 Accredited Company No. E997 ISO 27001:2005 (BS7799) Accredited Company No. IS 508906 Investor in People Certified No. 101507 The contents of this email are confidential to the addressee and are intended solely for the recipients use. If you are not the addressee, you have received this email in error. Any disclosure, copying, distribution or action taken in reliance on it is prohibited and may be unlawful. Any opinions expressed in this email are those of the author personally and not Technophobia Limited who do not accept responsibility for the contents of the message. All email communications, in and out of Technophobia, are recorded for monitoring purposes.