[DRBD-user] Reproducible ASSERT( os.conn == C_WF_REPORT_PARAMS )

Wed Jul 17 00:09:39 CEST 2013

On Mon, Jul 15, 2013 at 09:39:02PM +0100, Brian Candler wrote:
> In that backend script,
> 
> add a loop before the promote,
> that checks that the connection state really is "Connected",
> and the disk state really is "UpToDate".
> 
> It is probably supposed to already wait in this fashion,
> but apparently it does not.
> 
> Many thanks for your help! Here is what the code currently does,
> according to my reading of it.
> 
> "* checking disk consistency between source and target"
> 
> This is a hairy code path which could quite possibly be broken, but
> apparently it's checking that it's not degraded; which AFAICT means
> that connection state must be Connected and disk state must be
> UpToDate. Is this the bit which you think isn't being checked
> properly? If so I'll add some more debug in here.
> 
> "* changing into standalone mode"
> 
> - issue "drbdsetup <dev> disconnect"
> - polls /proc/drbd until connection state is Standalone
> - repeat for second device
> 
> "* changing disks into multi-master mode"
> 
> - issue "drbd setup <dev> <local-f:h:p> <remove-f:h:p> -A
> discard-zero-changes -B consensus --create-device -m"
> - run "drbdsetup <path> show" repeatedly until it finds that
> local_addr/port and remote_addr/port are the expected values
> - repeat for second device
> 
> [At this point I don't see any checks for disk state]
> 
> "* wait until resync is done"
> 
> - poll /proc/drbd until connection state is one of the following
> 
>   CS_WFREPORTPARAMS = "WFReportParams"

^^ not good enough.

>   CS_CONNECTED = "Connected"

^^ That is the one you want,
   *AND* a local disk state of "UpToDate".

vv should not happen
   (because you already did "wait until resync is done"),
   so if you see those, go back to "wait until resync is done".

>   CS_STARTINGSYNCS = "StartingSyncS"
>   CS_STARTINGSYNCT = "StartingSyncT"
>   CS_WFBITMAPS = "WFBitMapS"
>   CS_WFBITMAPT = "WFBitMapT"
>   CS_WFSYNCUUID = "WFSyncUUID"
>   CS_SYNCSOURCE = "SyncSource"
>   CS_SYNCTARGET = "SyncTarget"
>   CS_PAUSEDSYNCS = "PausedSyncS"
>   CS_PAUSEDSYNCT = "PausedSyncT"
> 
> - repeat for second device

btw "promote" is "switch into Primary", which is "drbdsetup primary",
and which you don't even mention here...

For this purpose and scenario,
you only want to promote once you are Connected UpToDate/UpToDate.

> - print progress info and repeat until (I believe) the connection
> state is Connected
> 
> So, does the promotion part (disconnect followed by reconnect in
> multi-master mode) look to be the correct approach for drbd 8.3? If
> so I'll concentrate on the "checking disk consistency" bit.

My understanding is that

 - ganeti wants live migration
 - for whatever reason live migration for some time period needs the block
   devices accessible on both hosts simultaneously
 - so drbd needs to be in dual primary mode for this time
 - drbd has a setting to "allow two primaries"; if that is not set,
   drbd would prevent the second primary as long as we have an
   established replication connection

   Note: just because this *allow* two primaries is set,
   does not mean they have to be both primary all the time.
   You may well promote (switch to Primary) one,
   then later the other,
   and later again demote (switch to Secondary) the first...

 - ganeti does not want to have the "allow-two-primaries" active all the
   time (for reasons (?))
 - drbd 8.3 does not allow to change this flag while the replication
   link is established
 - ganeti normally has the "allow" flag NOT set,
   thus they cannot "on demand" promote the migration target.
   So ganeti then "on demand" disconnects, then reconnects with the "allow" flag set
 - but now has to wait for the resync (which should be short, and may
   even be "empty", but DRBD has to do the resync handshake and do the
   state transition dance anyways to figure that out)
 - only now is the "allow" flag set,
   the promotion of the migration target possible [***]
 - once the migration is done, the migration source will be demoted,
 - then follows the inverse, the disconnect, reconnect with the "allow"
   flag *unset* again.

Now, apparently sometimes the promotion at [***] is timed badly,
and things break.

That's the background as I understand it.

Your options:
a)
  fix the timing of that promotion, to really only happen
  after the resync is through. If necessary add an extra loop
  to wait for connected uptodate/uptodate, as I tried to sketch earlier.

b)
  always have the "allow-two-primaries" set, and forget about all this
  disconnect/reconnect with different settings dance.

  Then the whole procedure above folds to

   - promote migration target
   - migrate
   - demote migration source

  no disconnect, reconnect, wait for whatever and again...

the "risk" during normal mode (which is supposedly single primary,
even though dual primary is allowed):
"someone" or "something" may be able to promote "accidentally",
then you'd have dual-primary, and DRBD would not "help" you
prevent this potential concurrent usage of the supposedly
exclusively used block device.

I'd say as ganeti should be the only entity messing with
promotion/demotion of DRBD, it should be able to coordinate
with itself sufficiently to not run dual-primary "accidentally",
so it should be safe to *allow* dual primary, even though you
normally drive single primary (except during live migrations).

Hth,
    Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.