[DRBD-user] LCMC display (and other tools) says "up to date" but DRBD is not

Thu Oct 25 11:33:17 CEST 2012

On Wed, Oct 24, 2012 at 05:44:49PM -0400, Whit Blauvelt wrote:
> 
> Date: Wed, 24 Oct 2012 17:16:21 -0400
> From: Whit Blauvelt <whit.drbd at transpect.com>
> To: Rasto Levrinc <rasto.levrinc at gmail.com>, drbd-user at lists.linbit.com
> Cc: drbd-mc at lists.linbit.com
> Subject: Re: [drbd-mc] LCMC display says "up to date" but DRBD is not
> User-Agent: Mutt/1.5.21 (2010-09-15)
> 
> I wrote:
> 
> > > I've got a fairly simple setup, that back some time ago was working well,
> > > but at some point has slipped away from me. I have a number of KVM VMs which
> > > have been set up by using a distinct LVM partition behind each, and then
> > > using DRBD to mirror these between two servers via a dedicated crossover.
> > > There are 6-8 VMs on each of the two servers, with both dedicated LVMs and
> > > dedicated DRBD resources. I've been using current versions of LCMC along the
> > > way to set up the DRBD mirroring. The LVMs have been set up using native
> > > tools, and the KVM VMs through libvirt.
> > > 
> > > To put the problem briefly, I've recently discovered, on shutting down VMs
> > > on one server and then restarting the VMs on the other, after shifting DRBD
> > > primary assignments, that the secondary DRBD storage has not kept up. This
> > > is despite Connected/UpToDate claims in the Storage display of LCMC.
> 
> 
> > The display in LCMC should be ok. Your problem is probably either your
> > config or an administration error at some point, forcing the DRBD to think
> > the data are up-to-date. You can run online verify to check if your
> > secondary has the same data as primary, before finding out the hard way. For
> > DRBD specific questions, you should ask in drbd-user mailing list.
> > 
> > Rasto
> 
> Thanks Rasto,
> 
> Including the drbd list now. 
> 
> I'm certainly capable of administrative error.

One such "adminnistrative error" we've come across much too frequently,
and which shows exactly these "symptoms", is this:

(I'm in the ascii art mood today...)

You at one point had:
---------------------

 VM
  \
   \
 [logical volume]

Then you added DRBD, and now you have:
--------------------------------------

 VM         [DRBD] -------- [DRBD] remote node
  \          /
   \        /             !! THIS IS WRONG !!
 [logical volume]

 (DRBD does not see or know about any changes done by VM)

But what you need is actually:
==============================

              VM 
              |
            [DRBD] -------- [DRBD] remote node
            /
 [logical volume]

 (DRBD sees every change done by the VM, and thus
  has a chance to mirror the changes over).

Cheers,
        Lars

> And the reporting of UpToDate
> when the filesystems are definitely not is deeper than LCMC - drbd-overview
> shows the same thing, "Connected UpToDate/UpToDate" even though the mirror
> doesn't match. "cat /prod/drdb" gives the same misinformation. "drbdadm
> cstate xxx" also gives "Connected". And "drbdadm dstate cent_s" gives
> "UpToDate/UpToDate" on both servers.
> 
> A problem with the "administrative error" hypothesis is that the DRBD
> administration has been, beyond the initial installation, entirely through
> LCMC. That is, it's a problem for LCMC (perhaps an older version though) if
> it allows an admin's error that results in false reports of up-to-date
> connections.
> 
> Using online verify also confirms that we're not at all up to date:
> 
> Oct 24 16:49:24 vm1 kernel: [5730169.131424] block drbd0: conn( Connected -> VerifyS ) 
> Oct 24 16:49:24 vm1 kernel: [5730169.131434] block drbd0: Starting Online Verify from sector 0
> Oct 24 16:49:24 vm1 kernel: [5730169.185546] block drbd0: Out of sync: start=584, size=8 (sectors)
> Oct 24 16:49:24 vm1 kernel: [5730169.188980] block drbd0: Out of sync: start=1112, size=16 (sectors)
> Oct 24 16:49:24 vm1 kernel: [5730169.236967] block drbd0: Out of sync: start=64, size=8 (sectors)
> Oct 24 16:49:24 vm1 kernel: [5730169.630823] block drbd0: Out of sync: start=32832, size=8 (sectors)
> ... on for 947 lines of such notices in this case.
> 
> Disconnecting and reconnecting the secondary should cause a resync per the
> manual. Okay. But that's not preventing the problem redeveloping - not
> identifying and correcting the cause.
> 
> To review how these were administratively set up: An LVM partition was used
> as a backing store in creating each VM. A matching LVM partition was created
> on the second server. LCMC was used at that point to assign both to DRBD,
> using the data from the first LVM.
> 
> It is initially working, or else the secondary wouldn't be populated at all.
> But it stops working at some point, while leaving DRBD showing that
> everything's just fine - short of running online verify or doing the
> disconnect-reconnect sequence. I could script disconnect-reconnect behavior
> overnight. That still wouldn't guarantee good mirrors in between, so DRBD
> still can't be 100% depended on for failover then.
> 
> This is not the most up-to-date system, drbd version 8.3.8.1. Still....
> 
> Whit
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed