[DRBD-user] Semantics of oos value, verification abortion

Thu Dec 28 04:25:13 CET 2017

Hi Christoph, 

I believe that, at least for synchronous replication with protocol C, the oos count should always be 0 in a healthy, fully synchronized configuration, and that any occurance of a value >0 (except for currently running manual administrative tasks) indicates a problem that requires to be investigated. Therefore I regard an automated disconnect-connect, for the sole purpose of clearing the oos counter without determining the cause, both a very bad idea and bad practice.
We have run hundreds of synchronously replicated DRBD8 volumes for years now that we verify weekly, but we never ever sighted oos that were not either caused by a runtime, configuration or hardware issue.

Our verification runs utilise a script similar to yours, but it actively parallelises the task to optimise for minimum duration while maintaining a constant load that won't harm performance. It does so by sorting all volumes by size and then run a given number of verify tasks at once, beginning with the largest volumes, and starting the next verify once one finishes. Especially on machines that have few very big volumes and lots of small ones, this allows to complete the verification of all volumes at the time the big volumes take alone, thus minimal duration at constant I/O load without peaks. The script prints a report to stdout with any occurance of oos to stderr, making it easy to filter for any problems -- even before monitoring notices. 

Best regards, 
// Veit 

-------- Ursprüngliche Nachricht --------
Von: Christoph Lechleitner <christoph.lechleitner at iteg.at>
Gesendet: 28. Dezember 2017 01:05:30 MEZ
An: drbd-user <drbd-user at lists.linbit.com>
CC: Wolfgang Glas <wolfgang.glas at iteg.at>
Betreff: [DRBD-user] Semantics of oos value, verification abortion

Hello everbody!

I have a question regarding the exact semantics of the oos value in
/proc/drbd.

The Users Guide
  https://docs.linbit.com/doc/users-guide-84/ch-admin/
says:
  "oos (out of sync). Amount of storage currently out of sync; in
Kibibytes. Since 8.2.6."

After several uncomforting events over the years we have now started to
do regular verify runs.

We will announce our script as open source right here at some point in
the future, but we want to clarify some details first.

Our script basically calls
  drbdadm verify
on one resource at a time, because
  drbdadm verify all
would kill the system for sure.

After the verification run has completed, the script
- analyses the oos: value,
- eventually disconnects & connects the resource
- starts verification of the next resource

The script does not run as daemon, it's simply called regularily via
cron, on the node with the more important resources.

My main question is:

Should the oos value always be 0?

Does a non-0 value of oos mean that there have been sync errors?

Or does oos include blocks that are currently beeing synched or waiting
to be synched, too?

In the latter case, what would be a valid condition to disconnect &
connect a resource after a verification run?

Also: Are there events that can cause a verification run to be aborted?

One verification run on a huge resource (1.3 TB, HW RAID 5, dedicated
GBit line) was finished way too fast, so I think something must have
aborted it, like, say,
- a buffer runs full
-> automatic disconnect/reconnect
-> verification aborted

If something along this line is possible, is there a way to avoid or
detect that?
Maybe a kernel message we could grep for?

Thanks,

Regards,

Christoph

-- 

Christoph Lechleitner

Geschäftsführung

------------------------------------------------------------------------
ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck
Mail: christoph.lechleitner at iteg.at | Web: http://www.iteg.at/
------------------------------------------------------------------------

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user