Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Christoph, I believe that, at least for synchronous replication with protocol C, the oos count should always be 0 in a healthy, fully synchronized configuration, and that any occurance of a value >0 (except for currently running manual administrative tasks) indicates a problem that requires to be investigated. Therefore I regard an automated disconnect-connect, for the sole purpose of clearing the oos counter without determining the cause, both a very bad idea and bad practice. We have run hundreds of synchronously replicated DRBD8 volumes for years now that we verify weekly, but we never ever sighted oos that were not either caused by a runtime, configuration or hardware issue. Our verification runs utilise a script similar to yours, but it actively parallelises the task to optimise for minimum duration while maintaining a constant load that won't harm performance. It does so by sorting all volumes by size and then run a given number of verify tasks at once, beginning with the largest volumes, and starting the next verify once one finishes. Especially on machines that have few very big volumes and lots of small ones, this allows to complete the verification of all volumes at the time the big volumes take alone, thus minimal duration at constant I/O load without peaks. The script prints a report to stdout with any occurance of oos to stderr, making it easy to filter for any problems -- even before monitoring notices. Best regards, // Veit -------- Ursprüngliche Nachricht -------- Von: Christoph Lechleitner <christoph.lechleitner at iteg.at> Gesendet: 28. Dezember 2017 01:05:30 MEZ An: drbd-user <drbd-user at lists.linbit.com> CC: Wolfgang Glas <wolfgang.glas at iteg.at> Betreff: [DRBD-user] Semantics of oos value, verification abortion Hello everbody! I have a question regarding the exact semantics of the oos value in /proc/drbd. The Users Guide https://docs.linbit.com/doc/users-guide-84/ch-admin/ says: "oos (out of sync). Amount of storage currently out of sync; in Kibibytes. Since 8.2.6." After several uncomforting events over the years we have now started to do regular verify runs. We will announce our script as open source right here at some point in the future, but we want to clarify some details first. Our script basically calls drbdadm verify on one resource at a time, because drbdadm verify all would kill the system for sure. After the verification run has completed, the script - analyses the oos: value, - eventually disconnects & connects the resource - starts verification of the next resource The script does not run as daemon, it's simply called regularily via cron, on the node with the more important resources. My main question is: Should the oos value always be 0? Does a non-0 value of oos mean that there have been sync errors? Or does oos include blocks that are currently beeing synched or waiting to be synched, too? In the latter case, what would be a valid condition to disconnect & connect a resource after a verification run? Also: Are there events that can cause a verification run to be aborted? One verification run on a huge resource (1.3 TB, HW RAID 5, dedicated GBit line) was finished way too fast, so I think something must have aborted it, like, say, - a buffer runs full -> automatic disconnect/reconnect -> verification aborted If something along this line is possible, is there a way to avoid or detect that? Maybe a kernel message we could grep for? Thanks, Regards, Christoph -- Christoph Lechleitner Geschäftsführung ------------------------------------------------------------------------ ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck Mail: christoph.lechleitner at iteg.at | Web: http://www.iteg.at/ ------------------------------------------------------------------------ _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user