[DRBD-user] Cannot synchronize stacked device to backup server with DRBD9

Tue Jun 19 13:06:13 CEST 2018

W dniu 19.06.2018 o 11:34, Lars Ellenberg pisze:
>>> And you cannot be bothered to report "such crashes"
>>> in a way that makes it possible to understand and fix those?
>>>
>>> "random system crash" is not good enough :-/
>>>
>>
>> Yep, i know it is not enough to find a reason of this crashes, and that why
>> i don't reported this separately, i asked only why stacking solution does
>> not work in my case :).
>>
>> Sorry but i cannot wrote to much more, this happening on production
>> environment and i cannot make tests there.
>> I can add:
>> - simple tests to reproduce this situation, but without high disk usage does
>> not create crashes
>> - problems started after upgrade from drbd 9.0.12 to 9.0.14 and drbd-utils
>> 9.3.0-1ppa1~xenial1 to 9.4.0-1ppa1~xenial1, before this we dont had such
>> crashes
>> - we have ~15 drbd resources on this environment, with high IO in random
>> pattern (databases, indexers, git, file servers, kvm etc)
> 
> Can you be more specific, what exactly is "crash"?
> Any "final words" from the kernel?
> You should capture kernel messages somewhere,
> even more so on a prod environment.
> 
> We have (test) environments with several thousand resources,
> and obviously produce heavy load, we have customers with prod
> environments with 1000+ resources, and "heavy load"...
> yes, they sometimes have problems, which we then help to solve.
> But nothing that would even remotely deserve the label "crash",
> not for a long time, anyways.
> So it is not at all "obvious" what your crashes may be.

I dont have any useful crash messages, but i will describe what i have 
done and what tested, maybe it could help. Unfortunately i cannot make 
more tests there, at least until next crash will happened :)

This crash could be a fresh problem, it happened after upgrade to last 
versions, ~3 weeks after 9.0.14 was released. Maybe other clients does 
not update its versions, or it is specific only for our 
hardware/configuration. W have multi resource configuration, we dont use 
multi volumes in one resource to use the same connection (this is in plans).

We got it two times:
- first in the middle of night, a few hours after drbd upgrade, when we 
made automatic verifications with final disconnect/connect
- second in the morning, the same day what first crash, when i tried to 
execute drbdadm disconnec/connect to reconnect backup server. Crash 
happened just after drdbadm connect command.

In first crash there was IO generated by backups/verifications.
Second time it was morning rush hours, where all systems where fully 
saturated.
The same day, on "silent" hours, when IO was low and there was no users, 
i tried to execute many connect/disconnect commands and cannot crash 
system. I dont tried to do it again on high IO.

In the same time, on second identical system, but with other services 
and much lower IO load we dont had any crashes.

Then it could be some single problem, unique to day/IO load/hardware 
etc. But after removing backup server from 3 node configuration i can 
connect/disconnect without problems on any IO load. We still use 9.0.14 
(with 2 node configuration, without backup server) and we dont had any 
crashes from this time. I know it is not hard argument for drbd 
problems, but we dont have other idea what could be a reason of system 
crash.

About "last word" of kernel, there is nothing in logs, it ends with some 
standard messages and start with system booting messages. It looks like 
kernel crash witch prevent syncing blocks/files to disks.
We dont store sol messages remotly, servers does not have kvm buffers 
and we dont have kernel configuration to store panics on remote 
machines, for now i cannot send any kernel panic messaged :(. But it is 
not bad idea to have some solution to catch kernel panics in this 
environment, thank you for suggestion.

> 
>>>> Unfortunately i cannot wait for next fix,
>>>> i need stable environment.
>>>
>>> "I want it all, and I want it now" :-)
>>>
>>> For the benefit of those that can afford to wait for the next fix,
>>> maybe you should still report the crashes in a way that we can work with.
>>>
>>
>> Sorry if i wrote it in wrong way, English is not my native language and i
>> did not want to be sound rude.
>> I only wrote about such situation:
>> - system works without crashes for months
>> - system is core production environment in company
>> - drbd upgrade causes random crashes (3 node configuration for drbd9)
>> - we cannot manage/create drbd resources because system could crash on any
>> drbdadm connect/disconnect command (what already happened in the middle of
>> day when we trying to reconnect backup server :/)
>>
>> Such situation does not allow me to wait for next fix, i need to find other
>> solution/workaround.
> 
> If DRBD 9 does not "behave" for your environment,
> what makes you think DRBD 9 in "stacked" would behave any better,
> for your situation?
> 

Im not 100% sure it will work correctly, for now im looking for some 
solution to fix situation, my decision bases on:
- after removing 3 node configuration i dont have any crashes
- i assume that 2 node configurations could use some old code inside 
drbd and could be more stable then new solution
- i tried 3 node configuration for ~6 months and have some number of 
problems which was specific to multi node config
- i never had a problem with 2 nodes configurations
- i found information that SUSE does not support DRBD9 in multi node 
configuration, only 2 nodes/stacking, meybe they know better what to use :)

>> Hmm, maybe they created resources some time ago and drbd works for already
>> created resources. That what i found is problem with initial synchronization
>> to backup server:
>> - source servers pair are up and one is primary
>> - backup server try to synchronize data (first time)
>> - primary server try to enter into Source state for stacked device, at this
>> moment it end with error:
>>
>> [1636671.252028] drbd system-test-U/0 drbd113 z1: helper command: /sbin/drbdadm before-resync-source
>> [1636671.255933] drbd system-test-U/0 drbd113: before-resync-source handler returned 1, dropping connection.
>> [1636671.255942] drbd system-test-U z1: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
>>
>> - the same error (error code) happened when i executed drbdadm before-resync-source directly:
>> 'system-test-U' is a stacked resource, and not available in normal mode.
> 
> That should have been fixed *a long time ago*,
>    2017-07-18 Nick Wang
>    [PATCH] drbdadm: Fix handler called from kernel always invalid for stacking resource
> 
> Apparently was never merged :-(
> 
> Apologies. We don't use or test "stacked" 9,
> because it does not make much sense in a DRBD 9 environment,
> we actually planned to patch it out completely.
> 
> Those that "successfully use" stacked drbd 9 apparently "silently"
> patched their utils (?) or use a wrapper as drbd "usermod_helper".
> 
> Fix pushed now:
> https://github.com/LINBIT/drbd-utils/commit/60ec9fa
> 

Great, huge thx, i will try it just after new deb package will be 
released :)

>>> Maybe you missed to upgrade your drbd-utils?
>>> Current drbd-utils version would be 9.4.0
> 
>> If someone could help me to understand this situation i will be really
>> grateful.
> 
> If DRBD 9 "misbehaves" for you,
> and you prefer "stacked" anyways, go with 8.4.
> 
> Still, if you can, please try to capture some
> "last words" for your crashes.
> 

I still believe in 9.0, but maybe not so much like at the start :)
Switch to 8.4 is last step which i will try, i hope 9.0 and stacking 
will work correctly, with 9.0 it will be much easier to switch again to 
3 node configuration in the future.

Again huge thx for this path, and i would like to help more with this 
crashes, but for now i cannot sand anything more. If i will get another 
system crash and i will catch any useful crash data i will send it.

-- 
Artur Kaszuba