[DRBD-user] DRBD versus bcache and caching in general.

Wed Aug 16 05:37:21 CEST 2017

Hello,

firstly let me state that I of course read the old thread from 2014 and
all the other bits I could find.

If anybody in the last 3 years actually deployed bcache or any of the
other SSD caching approaches with DRBD, I'd love to hear about it.

I'm looking to use bcache with DRBD in the near future and was pondering
the following scenarios, not all bcache specific.

The failure case I'm most interested in is a node going down due to HW or
kernel issues, as that's the only case I encountered in 10 years. ^.^

1. DRBD -> RAID HW cache -> HDD

This is what I've been using for long time (in some cases w/o RAID
controller and thus HW cache). 
If node A spontaneously reboots due to a HW failure or kernel crash,
things will fail over to node B, which is in best possible and up to
date state at this point.
Data in the HW cache (and the HDD local cache) is potentially lost.
From the DRBD perspective block X has been successfully written to node A
and B, even though it just reached the HW cache of the RAID controller.
So in the worst case scenario (HW cache lost/invalidated, HDD caches also
lost), we've just lost up to 4-5GB worth of in-flight data.
And unless something changed those blocks on node B before node A comes
back up, they will not be replicated back.

Is the above a correct, possible scenario?

As far as read caches are concerned, I'm pretty sure the HW caches get
invalidated in regards to reads when a crash/reboot happens.

2. Bcache -> DRBD -> HW cache -> HDD

With bcache in writeback mode things become interesting in the Chinese
sense. 
If node A crashes, not only do we loose all the dirty kernel buffers (as
always), but everything that was in-flight within bcache before being
flushed to DRBD. 
While the bcache documentation states that "Barriers/cache flushes are
handled correctly." and thus hopefully at least the FS would be in a
consistent state, the part that one needs to detach the bcache device or
switch to writethrough mode before the backing device is clean and
consistent confirms the potential for data loss.

I could live with bcache in write-through mode and leaving the write
caching to the HW cache, if losing and re-attaching a backing device
(DRBD) invalidates bcache and prevents it from delivering stale data. 
Alas the bcache documentation is pretty quiet here, from the looks of it
only detaching and re-attaching would achieve this.

3. DRBD -> bcache -> HW cache -> HDD

The sane and simple approach, as writes will get replicated, no additional
dangers in the write path when compared to 1) above. 

If node A goes down and node B takes over, only previous (recent) writes
will be in the bcache on node B, the cache will be "cold" otherwise. 
Once node A comes back the re-sync should hopefully take care of all stale
cache information on the A bcache.

Obviously having bcache as an associated resource as per Florian's old
video with would be the "safest" approach, but AFAICT there is no resource
agent for this and it would also introduce the write latency for
replication (twice?).

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Rakuten Communications