[DRBD-user] "PingAck not received" messages

Thu May 24 15:32:46 CEST 2012

On Thu, May 24, 2012 at 3:09 PM, Matthew Bloch <matthew at bytemark.co.uk> wrote:
>>> We are preparing to jump to a 2.6.32 sourced from CentOS because this
>>> Debian
>>> kernel seems to crash with one bug or another every few months.
>>
>> That would seem like an odd thing to do. FWIW, we've been running
>> happily on squeeze kernels for months.
>
> Then you've not hit the "scheduler divide by zero" bug or the "I/O frozen
> for 120s for no reason" bug or the "CPU#x stuck for 9999999s" bug?  These
> are all things that are filed vaguely on the Redhat bug trackers, as far as
> I know, and usually closed a few kernel versions later with "well I haven't
> seen it for a few kernel versions so it's probably OK"!

I've seen the "I/O frozen for no reason" problem which seems to be an
upstream XFS issue, which Debian is hardly to blame for. The others I
personally haven't encountered. Just for clarification, what seemed
odd to me was not that you would update off the Debian stock squeeze
kernel, but that you'd consider pulling a CentOS kernel, of ostensibly
thejsame kernel version, into a Debian system. I'd just go to the
current Debian backports kernel. But we're going off topic. :)

> These are relatively rare bugs, except for some of our customers, when
> they're not at all rare and we haul them up to e.g. whatever wheezy has.
>  Except in this case they broke the briding code in 3.2.0 which is going to
> cause a virtualising customer some problems :-)

Indeed.

> Previously the disconnects happened several times a day, not just when the
> backups ran - this is a separate issue from the one I asked about while
> still being relevant to the list.
>
> Arguably a customer running a heavily interactive system to very remote
> destinations shouldn't be using such a complex I/O stack and should use
> dedicated hardware.  This is a pragmatic, expensive, unambitious arguemnt
> :-)  But drbd+LVM has worked very well for them for 18 months, and the peace
> of mind of being able to start their customers' VMs in one of two places
> makes diagnosing this properly worth the effort.

It would still be interesting to find out whether you ever saw these
random disconnects with protocol C, or whether this appears to be a
B-only issue.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now