[Drbd-dev] Behaviour of verify: false positives -> true positives

Tue Sep 9 16:02:30 CEST 2008

Hi,

my company is considering drbd for building up failover clusters in shared 
hosting.

During our preliminary tests, we noticed that a "drbdadm verify /dev/drbdx" 
detects differences on a heavily loaded test server (several thousand 
customers).

We noticed two kind of verify differences: one is surely temporary (not 
repeatable), but the other is persistent, even after umounting the 
filesystem.

According to the manpage on drbd.conf (section "notes on data integrity"), 
these should be "false positives".  Indeed, we found no real corruptions (all 
different blocks were associated with deleted files).

However, this means that verify is (in _our_ point of view) no _reliable_ 
check for data integrity. Since data integrity of our valuable customer data 
is of great concern for us, we look for possibilities to change the behavior 
such that no false positives are reported any more, i.e. any difference 
reported by verify should be _guaranteed_ to be a "true positive". In my 
humble opinion, so-called "mission critical" applications demand for that in 
general.

In my understanding of kernel architecture, I believe the block differences 
are caused by an _intended_ race in the kernel at buffer cache level. 
Whenever a block gets dirty, there is (deliberately) no lock for any consumer 
of the buffer cache (such as ext3) which would prevent it from 
(re-)modification while the block is being written to an ordinary disk (even 
not necessarily a drbd device). IMHO, this deliberate race is _crucial_ for 
kernel performance (and thus I don't want to dispute on it).

Normally, this race should be no problem at all, even if an inconsistent block 
(half of the 512-byte block old, other half new version) is written to disk: 
the dirty-bit is just set again by the buffer cache level again, leading to 
another writeout which eventually fixes the problem.

I believe (but not 100% sure; please comment) that this model can explain the 
temporary differences seen by a drbd verify: when the data _content_ is 
mirrored across the network, the local disk version might have 
another "timestamp" (in _real_ time) than the version transmitted to 
ethernet. Thus, _any_ IO of dirty blocks might be inconsistent due to the 
deliberate kernel race on data block _content_ (dereferencing of 
buffer_head->b_data in parallel to disk IO). With ordinary load patterns, 
the "chance" to see _temporary_ false positives caused by that race is 
probably extremely low (perhaps one to some billion). But on a heavily loaded 
system we have observed it from time to time. Attached below are some tiny 
perl scripts which can reproduce temporary false positives with a fairly good 
chance at least on our test systems (I'd be glad to receive reproductions and 
experience from other users too). Just start io_rewrite_sector2.pl on an 
(preferably empty) ext3 filesystem on top of drbd, _in parallel_ to a "drbd 
verify /dev/corresponding_device" (other filesystems not yet tested). My test 
filesystem for this test was about 2GB size. Possibly you might have to 
adjust some constants to reproduce the test.

Up to now, I was reasoning on the _temporary_ false positives. Persistent 
false positives could be explained by the following theory:

When a file is eventually deleted or truncated, bforget() is called at the 
buffer cache interface. After that, dirty blocks are no longer transferred to 
disk, in order to save IO load (IMHO this is _crucial_ for typical access 
patterns on /tmp/ where typical lifetimes are often less than 1 second). As a 
consequence, the above-mentioned "fixing" of inconsistent blocks is no longer 
carried out and long-term differences can remain on the mirrored device, but 
belonging to deleted files only. Again, the chance to observe that is very 
low, but I have written another tiny perl script to reproduce that. Just 
start test_orphan.pl on an _empty_ drbd-mounted filesystem, and _afterwards_ 
check it with verify. Since the filesystem is empty again after the test, you 
can be sure that the differences belong to empty or orphan files (if it would 
belong to filesystem metadata, you would notice that by inspection of the 
contents with dd). By investigating the different counter values with dd on 
the underlying devices (primary vs secondary), you can even tell which 
version was written first. Interestingly, I found that most of the time the 
local version was the older one, but sometimes it was vice versa.

As a side note: test_orphan.pl also produces orphan files. Sometimes the 
counter values observed in different blocks are lower than the point of 
unlink() [currently set to 500], but most of the time the counter value is 
greater. It seems to me that applications producing orphan files raise the 
chance to observe false positives, but I am not sure. I have not yet a theory 
for that; please comment. It might be influenced by the timings of the block 
IO demon and/or by the load patterns.

-------

Now I am reasoning on different solutions. Please comment.

Here are just some brainstorming ideas, without judging on their quality (this 
will come later):

1) Try to avoid the kernel race on buffer content, specifically for drbd 
devices as an _option_ (which is _off_ by default). There are at least two 
sub-variants of that:
1a) use locking
1b) use an idea published by Herlihy for conflict-free resolution of different 
_versions_ of blocks, either on the fly or optionally residing in the buffer 
cache _in parallel_ [nb probably the latter could result in a major rewrite 
of large portions of the kernel, not to be disputed here on this list]

2) Whenever drbd-verify sees a difference, retry the comparison a few times 
after a short delay (possibly with exponential backoff), until _temporary_ 
differences have been filtered out. Persistent differences will not be 
tackled by that.

3) As an addition to 2), add an _option_ (which is _off_ by default) to the 
buffer cache code  to submit bforgotten() blocks specifically to drbd 
devices.

4) Add an option to drbd (as usual _off_ by default), which calculates a 
checksum on _every_ arriving IO request _first_ (before starting any 
sub-request). After finishing both the local and remote sub-IO, calculate the 
checksum again and compare. If a difference is found, restart both 
sub-transmissions again, until no mismatches are found any more.

5) As a refinement of 4), first filter out the temporary false positives by 
means of 2). Additionally try to identify bforgotten() blocks at the buffer 
cache level and submit them only _once_ after a bforget(), and only to drbd 
devices where the corresponding option is set. Then drbd uses method 4) 
_only_ on those blocks, thereby minimizing the performance impact of 4) to a 
rare special case.

6) Try to establish a complete solution in presence of races solely at the 
buffer cache level, without affecting drbd in any way. I am not sure whether 
this is possible. The raw idea is to _identify_ all races _reliably_(!) when 
they _actually_ occur (as in contrast to _possible_ occurrence). 
Theoretically b_count, b_state and/or other/similar means should be available 
to detect actually occurring races during IO, but I am extremely unsure 
whether this is possible _reliably_ without additional means such as 
checksumming. Probably this is the wrong list for discussing this topic, but 
I would be thankful for hints and ideas before going elsewhere with an 
uncomplete idea.

Now what I personally think of it: 1a) has too strong performance impact, 1b) 
would probably complicate the kernel by magnitudes, 2) is easy but does not 
solve all problems, 3) is feasible for non-general applications (outside 
of /tmp/ etc) but probably could lead to fundamental discussions with kernel 
developer (forcing us into an inhouse patch), 4) solves it all very easy but 
hurts performance, 5) is very complicated, and 6) is no mature idea yet.

Currently, I would prefer 4) as an option for testing and for gaining 
experience at the first try, but depending on performance results other 
mechanisms should be evaluated.

Of course, other people have other ideas and opinions, so please feel free to 
comment.

Thanks for your patience,

Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: io_rewrite_sector2.pl
Type: application/x-perl
Size: 287 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20080909/501e9c27/io_rewrite_sector2.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_orphan.pl
Type: application/x-perl
Size: 492 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20080909/501e9c27/test_orphan.bin