[DRBD-user] Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs...

Adam Wilbraham adam.wilbraham at technophobia.com
Thu Jan 12 11:51:41 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


We don't have fencing configured, this pair has no Pacemaker or
anything like that - its purely manual failover. IIRC we suspected
that the fencing handlers may have been causing the very occasional
reboots we had seen so disabled the reboot calls in the fencing
config. The handlers currently looks like this:

                  pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh;";
                  pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh;";
                  local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh;";

... but we don't see any notifications or anything logged to suggest
fencing has been called.


On 11 January 2012 17:15, Digimer <linux at alteeve.com> wrote:
> On 01/11/2012 11:56 AM, Adam Wilbraham wrote:
>> Anyway, we moved the servers to a new location over the weekend and
>> almost from the point of power up this parallel reboot issue reared
>> its head. If the servers were sat there idling they would be fine, but
>> the minute I started to boot up domUs I began the risk of it
>> happening. Normally the more domUs running, the more likely it was to
>> kick a reboot. It seemed like it was most likely to happen when
>> starting another domU rather than just doing its running of online
>> VMs. Anyway, I eventually narrowed this down to a point where I
>> realised that if I unplugged the network cable that was being used for
>> DRBD replication then the server would spew output to the screen and
>> reboot instantly, with the other one in the pair going about a second
>> later.
>
> If you have fencing configured, as you should, then you could have been
> seeing a dual-fence problem. Basically, both nodes send off their kill
> commands before one of them die. I believe this is a known issue with
> some iLO based fencing, but I can't quote source. Generally, the test is
> to put a 5sec delay into the fence script on one node. If it then
> reliably dies first and the other node lives, you found the issue.
>
> You *are* using fencing, right? ;)
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron



-- 
Adam Wilbraham
Senior Systems Administrator

Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE

t: +44 (0)114 2212123
e: adam.wilbraham at technophobia.com
w: http://www.technophobia.com/
http://twitter.com/WeTechnophobia

Part of the Capita Group: www.capita.co.uk

Registered in England and Wales Company No. 3063669
VAT registration No. 618 1841 40
ISO 9001:2000 Accredited Company No. 21227
ISO 14001:2004 Accredited Company No. E997
ISO 27001:2005 (BS7799) Accredited Company No. IS 508906
Investor in People Certified No. 101507

The contents of this email are confidential to the addressee
and are intended solely for the recipients use. If you are not
the addressee, you have received this email in error.
Any disclosure, copying, distribution or action taken in
reliance on it is prohibited and may be unlawful.

Any opinions expressed in this email are those of the author
personally and not Technophobia Limited who do not accept
responsibility for the contents of the message.

All email communications, in and out of Technophobia,
are recorded for monitoring purposes.



More information about the drbd-user mailing list