[DRBD-user] Semantics of oos value, verification abortion

Thu Dec 28 11:31:21 CET 2017

On 2017-12-28 04:25, Veit Wahlich wrote:
> Hi Christoph, 
> 
> I believe that, at least for synchronous replication with protocol C,

We use Protocol C, too.

> the oos count should always be 0 in a healthy, fully synchronized configuration, and that any occurance of a value >0 (except for currently running manual administrative tasks)

You mean something like switching roles?
Not an issue, we only switch roles when we really really have to (about
every 2 years, to upgrade OS and sometimes the hardware).

> indicates a problem that requires to be investigated. Therefore I regard an automated disconnect-connect, for the sole purpose of clearing the oos counter without determining the cause, both a very bad idea and bad practice.

But,

https://docs.linbit.com/doc/users-guide-84/s-use-online-verify/#s-online-verify-invoke
clearly recommends verify + disconnect + connect as solution.

However, we have non-0 oos values way too often for something that is
supposed to be stable for years now.

We have set both data-integrity-alg and verify-alg to crc32c, and the
dedicated GBit connections are usually heavily underused, so how can
there be synchronity problems on a daily bases?

My fear is that DRBD still has problems with I/O peak situations.

But I can't image a sub-optimal or "wrong" configuration "causing" that.

> We have run hundreds of synchronously replicated DRBD8 volumes for years now that we verify weekly, but we never ever sighted oos that were not either caused by a runtime, configuration or hardware issue.
> 
> Our verification runs utilise a script similar to yours, but it actively parallelises the task to optimise for minimum duration while maintaining a constant load that won't harm performance. It does so by sorting all volumes by size and then run a given number of verify tasks at once, beginning with the largest volumes, and starting the next verify once one finishes. Especially on machines that have few very big volumes and lots of small ones, this allows to complete the verification of all volumes at the time the big volumes take alone, thus minimal duration at constant I/O load without peaks. The script prints a report to stdout with any occurance of oos to stderr, making it easy to filter for any problems -- even before monitoring notices.

Nice approach ;-)

Regards,

Christoph