[DRBD-user] "PingAck not received" messages

Thu May 24 15:09:34 CEST 2012

On 24/05/12 13:54, Florian Haas wrote:
> On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <matthew at bytemark.co.uk> wrote:
>> Hmm, thanks.  Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
>> crashed pretty badly 48hrs ago.  Since it has been rebooted - there have
>> been no "PingAck not received" messages.
>
> Sure, if you have kernel-induced network problems on one of your
> nodes, that would definitely explain the issues you're seeing. But you
> insisted from the start that there were no network issues. :)

No indeed, nothing external that we could detect after hours of layer 2 
tracing, and no messages that would indicate a malfunction on either of 
the hosts.  But this network problem was only visible via DRBD's 
messages, and if it's gone it's hard to reason about it any further (not 
that I miss it).  As I said I couldn't see any symptoms via ICMP or 
TCP-based tests between the hosts.

>> We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian
>> kernel seems to crash with one bug or another every few months.
>
> That would seem like an odd thing to do. FWIW, we've been running
> happily on squeeze kernels for months.

Then you've not hit the "scheduler divide by zero" bug or the "I/O 
frozen for 120s for no reason" bug or the "CPU#x stuck for 9999999s" 
bug?  These are all things that are filed vaguely on the Redhat bug 
trackers, as far as I know, and usually closed a few kernel versions 
later with "well I haven't seen it for a few kernel versions so it's 
probably OK"!

These are relatively rare bugs, except for some of our customers, when 
they're not at all rare and we haul them up to e.g. whatever wheezy has. 
  Except in this case they broke the briding code in 3.2.0 which is 
going to cause a virtualising customer some problems :-)

>> The reason we're using external meta-devices is for backup: without the
>> metadata at the end, the underlying disk image represents exactly what the
>> VMs see.  We can then snapshot this and take a reasonably consistent backup
>> without bothering DRBD.  We later verify this backup by booting it back up,
>> disconnected, and taking a snapshot of the VNC console!
>
> You can always to that from a device with metadata as well. kpartx is
> your friend.

Sure, but neither do we pay a penalty for doing it externally.  It's all 
on LVM and proper battery-backed RAID.

>> The reason I picked protocol B is because LVM snaphots kill the local DRBD
>> performance if we snapshot the LVM device underlying the DRBD Primary.  If
>> we snapshot the Secondary and used protocol B where we weren't dependent on
>> local write speeds, my working theory was that the performance hit wouldn't
>> be as noticeable, and the customer seemed to concur (previously we were
>> using C).
>
> That's a fair point, but realistically, how long does it take you to
> take the backup off your snapshot?

10-60 minutes per system.  Long enough that the I/O sensitive VMs 
notice.  And the customer has customers who are up 24 hours a day, so 
there is no reliable "quiet time" when we can reduce their I/O bandwidth 
and not have it commented on.

> And does this normally coincide
> with the DRBD device getting hammered, which is pretty much the only
> situation in which a downstream client would likely feel any
> disruption?

The DRBDs don't really get hammered at any one time - the backups happen 
direct from LVs on the host, and go over the main (not replication) 
interface.  So the host system's I/O is stressed, sure.

Previously the disconnects happened several times a day, not just when 
the backups ran - this is a separate issue from the one I asked about 
while still being relevant to the list.

Arguably a customer running a heavily interactive system to very remote 
destinations shouldn't be using such a complex I/O stack and should use 
dedicated hardware.  This is a pragmatic, expensive, unambitious 
arguemnt :-)  But drbd+LVM has worked very well for them for 18 months, 
and the peace of mind of being able to start their customers' VMs in one 
of two places makes diagnosing this properly worth the effort.

-- 
Matthew