[DRBD-user] Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs...

Adam Wilbraham adam.wilbraham at technophobia.com
Wed Jan 11 17:56:01 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I've spent the past couple of days trying to get a pair of servers
into a state of stability and thought I would try and get down my
issues into a post & possible bug report whilst I have them in my
head. I'm probably going to miss some bits of information out, but
I'll try and get down what I remember. Its been a busy couple of days
& the combinations I've tried might mean that none of this is useful
for debugging because I haven't logged down as much as I should have
done as I was going through the troubleshooting process as I've been
blocking colleagues and therefore against the clock.

I started out with a pair of HP DL360 G6's which were built
approximately a year ago, running Debian Squeeze (before it became
stable) with Xen 4.0 on top all from Apt and with DRBD 8.3.10 built
from source (I believe this was stable at the time). The pair of
servers have only been used as internal development hosts and were
never patched up when Debian went stable, so the kernel version was a
little out of date
(xen-linux-system-2.6.32-5-xen-amd64_2.6.32-30_amd64.deb) as were
other packages. Over the last year the pair have been stable almost
all of the time, but we did have a couple of incidents where the pair
would reboot in tandem but because they weren't business critical the
resolution of this was never a priority.

Anyway, we moved the servers to a new location over the weekend and
almost from the point of power up this parallel reboot issue reared
its head. If the servers were sat there idling they would be fine, but
the minute I started to boot up domUs I began the risk of it
happening. Normally the more domUs running, the more likely it was to
kick a reboot. It seemed like it was most likely to happen when
starting another domU rather than just doing its running of online
VMs. Anyway, I eventually narrowed this down to a point where I
realised that if I unplugged the network cable that was being used for
DRBD replication then the server would spew output to the screen and
reboot instantly, with the other one in the pair going about a second
later.

First thing I thought here was that I'm massively out of the date on
patches, so lets apt-get update & apt-get dist-upgrade - this brought
a new kernel with it
(xen-linux-system-2.6.32-5-xen-amd64_2.6.32-38_amd64.deb) and at the
same time I think I decided it was probably wise to go to the latest
DRBD (8.4.1) so built the module and tools and off I went. This
brought an end to the random reboots, but it also brought new problems
which seemed to suggest that Xen could no longer properly access the
disk subsystem being exposed paravirtually into a domU if the domU was
using the Debian Squeeze kernel. I would get error messages from the
kernel within the domU saying that processes have been blocked for 120
seconds on startup and it never completed.At the same time, I was
seeing kernel oops messages on the console with a large hex string
being pasted out. There may be some stuff in my kern.log of relevance
here actually - I'll have to have a fish around. For reference, domUs
which I had running with Etch and Lenny kernels wouldn't exhibit this
problem and they booted fine.

I did some looking around and found various references to bugs in the
current Squeeze kernel, and suggestions to try the one from proposed
updates (2.6.32-40). Unfortunately, this didn't make any difference to
my problem.

As this was so far all looking kernel related I went looking for a
newer prebuilt kernel which I could try so first of all pulled
linux-image-2.6.39-bpo.2-amd64_2.6.39-3~bpo60+1_amd64.deb from Squeeze
backports, however this doesn't have blkback support so this meant
that its no good as a dom0 kernel. I then went and grabbed
linux-image-3.1.0-1-amd64_3.1.6-1_amd64.deb from Wheezy / Testing and
found that this works absolutely fine for me. Due to it being built
with gcc-4.6 I'm not in a position to build DRBD from source without
another chunk of work, so for the quickest reverted back to the in
kernel version (8.3.11) and grabbed a matching tools deb (albeit from
Ubuntu) and lo and behold I appear to have reached a point of
stability.

So to sum up - this pair is currently now running Squeeze (including
all proposed updates) + Wheezy Kernel +
drbd8-utils_2%3a8.3.7-2.1_amd64.deb from Ubuntu and I finally seem to
have reached a working state again.


-- 
Adam Wilbraham
Senior Systems Administrator

Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE

t: +44 (0)114 2212123
e: adam.wilbraham at technophobia.com
w: http://www.technophobia.com/
http://twitter.com/WeTechnophobia

Part of the Capita Group: www.capita.co.uk

Registered in England and Wales Company No. 3063669
VAT registration No. 618 1841 40
ISO 9001:2000 Accredited Company No. 21227
ISO 14001:2004 Accredited Company No. E997
ISO 27001:2005 (BS7799) Accredited Company No. IS 508906
Investor in People Certified No. 101507

The contents of this email are confidential to the addressee
and are intended solely for the recipients use. If you are not
the addressee, you have received this email in error.
Any disclosure, copying, distribution or action taken in
reliance on it is prohibited and may be unlawful.

Any opinions expressed in this email are those of the author
personally and not Technophobia Limited who do not accept
responsibility for the contents of the message.

All email communications, in and out of Technophobia,
are recorded for monitoring purposes.



More information about the drbd-user mailing list