[DRBD-user] protocol C replication - unexpected behaviour

Wed Aug 11 20:18:17 CEST 2021

Looking here
https://lists.linbit.com/pipermail/drbd-user/2017-September/023601.html for
an explanation how protocol C works it seems that the scenario that I
described above is possible if the data is cached above drbd before we
disconnect the secondary node.

my disk-flushes and md-flushes are left default, so they are 'yes'.

Regards,
Janusz.

śr., 11 sie 2021 o 01:05 Digimer <lists at alteeve.ca> napisał(a):

> On 2021-08-10 3:16 p.m., Janusz Jaskiewicz wrote:
>
> Hi,
>
> Thanks for your answers.
>
> Answering your questions:
> DRBD_KERNEL_VERSION=9.0.25
>
> Linux kernel:
> 4.18.0-305.3.1.el8.x86_6
>
> File system type:
> XFS.
>
> So the file system is not cluster-aware, but as far as I understand in an
> active/passive setup - single primary (that I have) it should be OK.
> Just checked the doc which seems to confirm that.
>
> I think the problem may come from the way I'm testing it.
> I came up with this testing scenario, that I described in my first post,
> because I didn't have an easy way to abruptly restart the server.
> When I do the hard reset of the primary server it works as expected (at
> least I can find a logical explanation).
>
> I think what happened in my previous scenario was:
> Service is writing to the disk, and some portion of the written data is in
> a disk cache. As the picture
> https://linbit.com/wp-content/uploads/drbd/drbd-guide-9_0-en/images/drbd-in-kernel.png
> shows, the cache is above the DRBD module.
> Then I kill the service and the network, but some data is still in the
> cache.
> At some point the cache is flushed and the data gets written to the disk.
> DRBD probably reports some error at this point, as it can't send that data
> to the secondary node (DRBD thinks the other node has left the cluster).
>
> When I check the files at this point I see more data on the primary
> because it also contains the data from the cache, which were not replicated
> because the network was down when the data hit the DRBD.
>
> When I do the hard restart of the server, data in the cache is lost, so we
> don't observe the result as above.
>
> Does it make sense?
>
> Regards,
> Janusz.
>
> OK, it sounded from your first post like you have the FS mounted on both
> nodes at the same time, that would be a problem. If it's only mounted in
> one place at a time, then it's OK.
>
> As for caching; DRBD on the Secondary will say "write complete" to the
> primary, in protocol C, when it has been told that the disk write is
> complete. So if the cache is _above_ drbd's kernel module, then that's
> probably not the problem because the Secondary won't tell the primary it's
> done until it receives the data. If there is a caching issue _below_ DRBD
> on the Secondary, then it's _possible_ that's the problem, but I doubt it.
> The reason is that whatever is managing the cache below DRBD on the
> Secondary should know that a given block hasn't flushed yet and, on read
> request, read from cache not disk. This is a guess on my part.
>
> What are your 'disk { disk-flushes [yes|no]; and md-flushes [yes|no]; }'
> set to?
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210811/3da3f886/attachment.htm>