[DRBD-user] "PingAck not received" messages

Thu May 24 14:55:10 CEST 2012

On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <matthew at bytemark.co.uk>
wrote:
> Hmm, thanks.  Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
> crashed pretty badly 48hrs ago.  Since it has been rebooted - there have
> been no "PingAck not received" messages.

Sure, if you have kernel-induced network problems on one of your
nodes, that would definitely explain the issues you're seeing. But you
insisted from the start that there were no network issues. :)

> So assuming we get a week free of these messages, I'm guessing there was a
> drbd bug of some kind but the reboot cleared it up.

Might as well not be a DRBD bug at all, just DRBD trying to do the
right thing in the face of a flaky network stack.

> We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian
> kernel seems to crash with one bug or another every few months.

That would seem like an odd thing to do. FWIW, we've been running
happily on squeeze kernels for months.

> The reason we're using external meta-devices is for backup: without the
> metadata at the end, the underlying disk image represents exactly what the
> VMs see.  We can then snapshot this and take a reasonably consistent backup
> without bothering DRBD.  We later verify this backup by booting it back up,
> disconnected, and taking a snapshot of the VNC console!

You can always to that from a device with metadata as well. kpartx is
your friend.

> The reason I picked protocol B is because LVM snaphots kill the local DRBD
> performance if we snapshot the LVM device underlying the DRBD Primary.  If
> we snapshot the Secondary and used protocol B where we weren't dependent on
> local write speeds, my working theory was that the performance hit wouldn't
> be as noticeable, and the customer seemed to concur (previously we were
> using C).

That's a fair point, but realistically, how long does it take you to
take the backup off your snapshot? And does this normally coincide
with the DRBD device getting hammered, which is pretty much the only
situation in which a downstream client would likely feel any
disruption?

Just my two cents. Or pence. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now