[DRBD-user] BUG: Uncatchable DRBD out-of-sync issue

Tue Mar 11 07:39:14 CET 2014

Hello Lars,

>   Upper layer submits write to DRBD.
>   DRBD calculates checksum over data buffer.
>   DRBD sends that checksum.
>   DRBD submits data buffer to "local" backend block device.
>       Meanwhile, upper layer changes data buffer.
>   DRBD sends data buffer to peer.
>   DRBD receives local completion.
>   DRBD receives remote ACK.
>   DRBD completes this write to upper layer.
>       *only now* would the upper layer be "allowed"
>       to change that data buffer again.

I think you were right and upper layer misbehaves. I've turned write
caching off for Linux KVMs and last check found only one OOS (it probably
caused before I turned caching off, so I'll wait one more week). Thank you
for pointing the right way to dig.

So far I see the following ways to avoid OOS.
1. Disabling write caching
2. Using barriers for guest OSes - it is enabled by default for ext4 and
can be enabled for ext3 but:
- can't be enabled for swap
- not sure what to do with Windows guests (it is assumed that NTFS supports
barriers but I've seen OOS caused on Windows partitions several times, may
be I need to disable write caching inside Windows)

The first way can cause slowdowns. The second way is to difficult
especially when you can't control guest OSes.

After all I wonder why DRBD can't copy the buffer before writing and then
submit/send this copy and not the origin (that can be changed any time)?

Best regards,
Stanislav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140311/6b086956/attachment.htm>