[DRBD-user] Default drbdmanage system-kill behavior

Mariusz Mazur mmazur at axeos.com
Thu Nov 2 16:52:11 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


2017-10-30 11:53 GMT+01:00 Robert Altnoeder <robert.altnoeder at linbit.com>:

> Drbdmanage just does not compensate for LVM's shortcomings.
> (…)
> This is also the reason that the codebase of the product that will
> replace drbdmanage in the future is already multiple times the size of
> the current drbdmanage, although it is still in an early stage of its
> development and just barely starting to do anything useful. One cause
> for this increase in size is that even in the experimental version that
> we have right now, the 55 lines of code that attempt LVM volume creation
> are backed by about 2000 lines of error detection, error correction and
> error reporting code.

To be honest the fact you're writing a system that will try to deal
with lvm on its own sounds very encouraging to me.

> We could try to provide a product that deals with as many of the
> potential problems as anyone can think of, but since someone obviously
> has to do all the work, the question is: How much would you be willing
> to pay for it?
>
> (…)
>
>> (Btw: are there any other 5.4.1s a new user should be aware of?)
> Thousands probably, depending on the exact configuration.
>
> - Thin provisioning might lock up your system if you run out of space.
> - DRBD meta data may need to clean up slots if you have used them all
> and then replace a node with another node that has a different node ID.
> - LVM may become extremely slow if it is not configured to ignore DRBD
> devices and there are lots of DRBD devices that cannot be opened, e.g.
> because there is another Primary.
> - etc. ...
>
> Apparently, most people have not ever hit the problem you describe. I
> did not ever see it come up in my test environment. Some others have hit
> other problems that you did not encounter.

I understand that writing tested software takes time and money. Having
a not-so-complicated version as the first attempt is the reasonable
choice. However I don't think I can explain to you how frustrating it
was to find out that something I've spent so much time dealing with
was a well known issue, just not documented properly.

The way I see it is this: running lvmcreate by drbdmanage will lock a
system under known conditions. They aren't abnormal conditions. Simply
having the bad luck of running lvmcreate against disk space that was
previously utilized (not an uncommon occurrence on a sufficiently
large/old system) will kill said system. I was lucky enough to hit the
problem on 'drbdmanage init'. There will be people less lucky then me
that get this after two years in production.

Avoiding this very low probability russian roulette is very easy, as
long as an admin is made aware of it.

May I suggest putting all this stuff under a single section of the
documentation and then:
- mentioning it really early in the "5. Common administrative tasks -
DRBD Manage" that if you don't take a look at this section and
configure your system accordingly, you're asking for trouble
- putting a link at the top of "8. Troubleshooting and error recovery"
- having 'drbdmanage init' mention it. It's quite chatty already, no
reason not to have it nag the admin to go through a system config
checklist.

Not a single line of code needs to be written (well, almost) and a lot
of admin hours might get saved. Hell, I'll do it if the docs have
sources somewhere public and you confirm that's an acceptable change
for you.



More information about the drbd-user mailing list