<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 2021-08-10 3:16 p.m., Janusz
Jaskiewicz wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGf4UHBwuE_w21RxbR8aq-hC_4H01bD_1nO4UvtbdbNJ+Eb0ig@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi,
<div><br>
</div>
<div>Thanks for your answers.</div>
<div><br>
</div>
<div>Answering your questions:</div>
<div>DRBD_KERNEL_VERSION=9.0.25<br>
</div>
<div><br>
</div>
<div>Linux kernel:</div>
<div>4.18.0-305.3.1.el8.x86_6<br>
</div>
<div><br>
</div>
<div>File system type: </div>
<div>XFS.</div>
<div><br>
</div>
<div>So the file system is not cluster-aware, but as far as I
understand in an active/passive setup - single primary (that I
have) it should be OK.</div>
<div>Just checked the doc which seems to confirm that.</div>
<div><br>
</div>
<div>I think the problem may come from the way I'm testing it.</div>
<div>I came up with this testing scenario, that I described in
my first post, because I didn't have an easy way to abruptly
restart the server.</div>
<div>When I do the hard reset of the primary server it works as
expected (at least I can find a logical explanation).</div>
<div><br>
</div>
<div>I think what happened in my previous scenario was:</div>
<div>Service is writing to the disk, and some portion of the
written data is in a disk cache. As the picture <a
href="https://linbit.com/wp-content/uploads/drbd/drbd-guide-9_0-en/images/drbd-in-kernel.png"
moz-do-not-send="true">https://linbit.com/wp-content/uploads/drbd/drbd-guide-9_0-en/images/drbd-in-kernel.png</a>
shows, the cache is above the DRBD module.</div>
<div>Then I kill the service and the network, but some data is
still in the cache.</div>
<div>At some point the cache is flushed and the data gets
written to the disk.</div>
<div>DRBD probably reports some error at this point, as it can't
send that data to the secondary node (DRBD thinks the other
node has left the cluster).</div>
<div><br>
</div>
<div>When I check the files at this point I see more data on the
primary because it also contains the data from the cache,
which were not replicated because the network was down when
the data hit the DRBD.</div>
<div><br>
</div>
<div>When I do the hard restart of the server, data in the cache
is lost, so we don't observe the result as above.</div>
<div><br>
</div>
<div>Does it make sense?</div>
<div><br>
</div>
<div>Regards,</div>
<div>Janusz.</div>
</div>
</blockquote>
<p>OK, it sounded from your first post like you have the FS mounted
on both nodes at the same time, that would be a problem. If it's
only mounted in one place at a time, then it's OK.</p>
<p>As for caching; DRBD on the Secondary will say "write complete"
to the primary, in protocol C, when it has been told that the disk
write is complete. So if the cache is _above_ drbd's kernel
module, then that's probably not the problem because the Secondary
won't tell the primary it's done until it receives the data. If
there is a caching issue _below_ DRBD on the Secondary, then it's
_possible_ that's the problem, but I doubt it. The reason is that
whatever is managing the cache below DRBD on the Secondary should
know that a given block hasn't flushed yet and, on read request,
read from cache not disk. This is a guess on my part.</p>
<p>What are your 'disk { disk-flushes [yes|no]; and md-flushes
[yes|no]; }' set to?</p>
<pre class="moz-signature" cols="72">--
Digimer
Papers and Projects: <a class="moz-txt-link-freetext" href="https://alteeve.com/w/">https://alteeve.com/w/</a>
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould</pre>
</body>
</html>