Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <matthew at bytemark.co.uk> wrote: > Hmm, thanks. Unrelated to any of this, the v3a kernel (Debian 2.6.32-4) > crashed pretty badly 48hrs ago. Since it has been rebooted - there have > been no "PingAck not received" messages. Sure, if you have kernel-induced network problems on one of your nodes, that would definitely explain the issues you're seeing. But you insisted from the start that there were no network issues. :) > So assuming we get a week free of these messages, I'm guessing there was a > drbd bug of some kind but the reboot cleared it up. Might as well not be a DRBD bug at all, just DRBD trying to do the right thing in the face of a flaky network stack. > We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian > kernel seems to crash with one bug or another every few months. That would seem like an odd thing to do. FWIW, we've been running happily on squeeze kernels for months. > The reason we're using external meta-devices is for backup: without the > metadata at the end, the underlying disk image represents exactly what the > VMs see. We can then snapshot this and take a reasonably consistent backup > without bothering DRBD. We later verify this backup by booting it back up, > disconnected, and taking a snapshot of the VNC console! You can always to that from a device with metadata as well. kpartx is your friend. > The reason I picked protocol B is because LVM snaphots kill the local DRBD > performance if we snapshot the LVM device underlying the DRBD Primary. If > we snapshot the Secondary and used protocol B where we weren't dependent on > local write speeds, my working theory was that the performance hit wouldn't > be as noticeable, and the customer seemed to concur (previously we were > using C). That's a fair point, but realistically, how long does it take you to take the backup off your snapshot? And does this normally coincide with the DRBD device getting hammered, which is pretty much the only situation in which a downstream client would likely feel any disruption? Just my two cents. Or pence. :) Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now