[DRBD-user] Cannot synchronize stacked device to backup server with DRBD9

Tue Jun 19 11:34:16 CEST 2018

On Tue, Jun 19, 2018 at 09:19:04AM +0200, Artur Kaszuba wrote:
> Hi Lars, thx for answer
> 
> W dniu 18.06.2018 o 17:10, Lars Ellenberg pisze:
> > On Wed, Jun 13, 2018 at 01:03:53PM +0200, Artur Kaszuba wrote:
> > > I know about 3 node solution and i have used it for some time (from ~9.0.8),
> > > but i had problems with stability and decided to change configuration to
> > > stacked configuration, with hope it will work more stable. As a last
> > > solution i will downgrade to drbd8 where never had any problems with
> > > stability, but i would like to stay with 9 and after some time switch again
> > > to 3 node config.
> > > 
> > > By stability i mean such situation:
> > > - last version of drbd9 (9.0.14)
> > > - kernel 4.13.0-43-generic on Ubuntu 16.04
> > > - high disk usage/IO on drbd devices
> > > - 3 node configuration
> > > - random system crash on "drbdam disconnect/connect" command
> > > When i disable one node everything works without problems and
> > > disconnect/connect works perfectly. Before 9.0.14 i dont had such crashes,
> > > but had other which are fixed now.
> > 
> > And you cannot be bothered to report "such crashes"
> > in a way that makes it possible to understand and fix those?
> > 
> > "random system crash" is not good enough :-/
> > 
> 
> Yep, i know it is not enough to find a reason of this crashes, and that why
> i don't reported this separately, i asked only why stacking solution does
> not work in my case :).
> 
> Sorry but i cannot wrote to much more, this happening on production
> environment and i cannot make tests there.
> I can add:
> - simple tests to reproduce this situation, but without high disk usage does
> not create crashes
> - problems started after upgrade from drbd 9.0.12 to 9.0.14 and drbd-utils
> 9.3.0-1ppa1~xenial1 to 9.4.0-1ppa1~xenial1, before this we dont had such
> crashes
> - we have ~15 drbd resources on this environment, with high IO in random
> pattern (databases, indexers, git, file servers, kvm etc)

Can you be more specific, what exactly is "crash"?
Any "final words" from the kernel?
You should capture kernel messages somewhere,
even more so on a prod environment.

We have (test) environments with several thousand resources,
and obviously produce heavy load, we have customers with prod
environments with 1000+ resources, and "heavy load"...
yes, they sometimes have problems, which we then help to solve.
But nothing that would even remotely deserve the label "crash",
not for a long time, anyways.
So it is not at all "obvious" what your crashes may be.

> > > Unfortunately i cannot wait for next fix,
> > > i need stable environment.
> > 
> > "I want it all, and I want it now" :-)
> > 
> > For the benefit of those that can afford to wait for the next fix,
> > maybe you should still report the crashes in a way that we can work with.
> > 
> 
> Sorry if i wrote it in wrong way, English is not my native language and i
> did not want to be sound rude.
> I only wrote about such situation:
> - system works without crashes for months
> - system is core production environment in company
> - drbd upgrade causes random crashes (3 node configuration for drbd9)
> - we cannot manage/create drbd resources because system could crash on any
> drbdadm connect/disconnect command (what already happened in the middle of
> day when we trying to reconnect backup server :/)
> 
> Such situation does not allow me to wait for next fix, i need to find other
> solution/workaround.

If DRBD 9 does not "behave" for your environment,
what makes you think DRBD 9 in "stacked" would behave any better,
for your situation?

> > > I prefer to use stacking configuration, even when it is deprecated in
> > > DRBD9.  I decided to write this post because stacked configuration is
> > > still described in documentation and should work? Unfortunately for
> > > now it is not possible to create such configuration or i missed
> > > something :/
> > 
> > I know there are DRBD 9 users using "stacked" configurations out there.
> > 
> 
> Hmm, maybe they created resources some time ago and drbd works for already
> created resources. That what i found is problem with initial synchronization
> to backup server:
> - source servers pair are up and one is primary
> - backup server try to synchronize data (first time)
> - primary server try to enter into Source state for stacked device, at this
> moment it end with error:
> 
> [1636671.252028] drbd system-test-U/0 drbd113 z1: helper command: /sbin/drbdadm before-resync-source
> [1636671.255933] drbd system-test-U/0 drbd113: before-resync-source handler returned 1, dropping connection.
> [1636671.255942] drbd system-test-U z1: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
> 
> - the same error (error code) happened when i executed drbdadm before-resync-source directly:
> 'system-test-U' is a stacked resource, and not available in normal mode.

That should have been fixed *a long time ago*,
  2017-07-18 Nick Wang
  [PATCH] drbdadm: Fix handler called from kernel always invalid for stacking resource

Apparently was never merged :-(

Apologies. We don't use or test "stacked" 9,
because it does not make much sense in a DRBD 9 environment,
we actually planned to patch it out completely.

Those that "successfully use" stacked drbd 9 apparently "silently"
patched their utils (?) or use a wrapper as drbd "usermod_helper".

Fix pushed now:
https://github.com/LINBIT/drbd-utils/commit/60ec9fa

> > Maybe you missed to upgrade your drbd-utils?
> > Current drbd-utils version would be 9.4.0

> If someone could help me to understand this situation i will be really
> grateful.

If DRBD 9 "misbehaves" for you,
and you prefer "stacked" anyways, go with 8.4.

Still, if you can, please try to capture some
"last words" for your crashes.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed