[DRBD-user] DRBD versus bcache and caching in general.Amen

Emmanuel Florac eflorac at intellique.com
Wed Sep 6 14:23:05 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Le Wed, 6 Sep 2017 10:37:42 +0900
Christian Balzer <chibi at gol.com> écrivait:

> And once again, the deafening silence shall be broken by replying to
> myself.

Don't be bitter :)
 
> The below is all on Debian Stretch with a 4.11 kernel.
> 
> I tested bcache w/o DRBD initially and the performance as well as
> default behavior (not caching large IOs) was quite good and a perfect
> fit for my use case (mailbox servers). 
> 
> This was true in combination with DRBD as well.
> 
> However it turns out that bcache will not work out of the box with
> DRBD thanks to the slightly inane requirement by its udev helper to
> identify things with lsblk. 
> Which identifies the backing device as DRBD after a reboot and thus
> doesn't auto-assemble the bcache device.
> Hacking that udev rule or simply registering the backing device in
> rc.local will do the trick, but it felt crude.

In the contrary, this looks like perfectly sane Unix-fu to me. Worse is
better, remember? Seriously, udev rules are made easy to write
precisely for all these possible border cases.

> So I tried dm-cache, which doesn't have that particular issue.
> But then again, the complexity of it (lvm in general), the vast
> changes between versions and documentation gotchas, the fact that a
> package required to assemble things at boot time wasn't
> "required" (thin-provisioning-tools) also made this a rather painful
> and involved experience compared to bache. 

That, and its performance sucks. In fact I couldn't get any significant
gain using dm-cache, while bcache is very easily tuned to provide high
gains. Hail bcache.

> Its performance is significantly lower and very spiky, with fio stdev
> an order of magnitude higher than bcache.

My opinion precisely.

> For example I could run 2 fio processes doing 4k randwrites capped to
> 5k IOPS each (so 10k total) on top of the bcache DRBD indefinitely
> with the backing device never getting busier than 10% when flushing
> commenced. This test on the same HW with dm-cache yielded 8K IOPS
> max, with high fluctuations and both the cache and backing devices
> getting pegged at 100% busy at times.
> 
> What finally broke the straw was that with dm-cache, formatting the
> drbd device with ext4 hung things to the point of requiring a forced
> reboot. This was caused by mkfs.ext4 trying to discard blocks (same
> for bcache), which is odd, but then again should just work (it does
> for bcache). Formatting with nodiscard works and the dm-cache drbd
> device then doesn't support fstrim when mounted, unlike bcache.
>  
> So I've settled for bcache at this time, the smoother performance is
> worth the rc.local hack in my book.

Amen to that.

As I missed your previous post I'll reply below just in case my opinion
may matter :)

> Christian
> 
> On Wed, 16 Aug 2017 12:37:21 +0900 Christian Balzer wrote:
> 
> > Hello,
> > 
> > firstly let me state that I of course read the old thread from 2014
> > and all the other bits I could find.
> > 
> > If anybody in the last 3 years actually deployed bcache or any of
> > the other SSD caching approaches with DRBD, I'd love to hear about
> > it.
> > 
> > I'm looking to use bcache with DRBD in the near future and was
> > pondering the following scenarios, not all bcache specific.
> > 
> > The failure case I'm most interested in is a node going down due to
> > HW or kernel issues, as that's the only case I encountered in 10
> > years. ^.^
> > 
> > 
> > 1. DRBD -> RAID HW cache -> HDD
> > 
> > This is what I've been using for long time (in some cases w/o RAID
> > controller and thus HW cache). 
> > If node A spontaneously reboots due to a HW failure or kernel crash,
> > things will fail over to node B, which is in best possible and up to
> > date state at this point.
> > Data in the HW cache (and the HDD local cache) is potentially lost.
> > From the DRBD perspective block X has been successfully written to
> > node A and B, even though it just reached the HW cache of the RAID
> > controller. So in the worst case scenario (HW cache
> > lost/invalidated, HDD caches also lost), we've just lost up to
> > 4-5GB worth of in-flight data. And unless something changed those
> > blocks on node B before node A comes back up, they will not be
> > replicated back.
> > 
> > Is the above a correct, possible scenario?

As I understand this scenario, you suppose that both your DRBD nodes
have HW RAID controllers, that the node 1 fails, prompting a fail
over to node B, then node B fails immediately? You shouldn't lose
anything, provided that your HW RAID controller has a BBU, therefore
CAN'T lose its cache.

Really, use a BBU. Or did I miss something?

> > 
> > As far as read caches are concerned, I'm pretty sure the HW caches
> > get invalidated in regards to reads when a crash/reboot happens.
> > 
> > 
> > 2. Bcache -> DRBD -> HW cache -> HDD
> > 
> > With bcache in writeback mode things become interesting in the
> > Chinese sense. 
> > If node A crashes, not only do we loose all the dirty kernel
> > buffers (as always), but everything that was in-flight within
> > bcache before being flushed to DRBD. 
> > While the bcache documentation states that "Barriers/cache flushes
> > are handled correctly." and thus hopefully at least the FS would be
> > in a consistent state, the part that one needs to detach the bcache
> > device or switch to writethrough mode before the backing device is
> > clean and consistent confirms the potential for data loss.
> > 
> > I could live with bcache in write-through mode and leaving the write
> > caching to the HW cache, if losing and re-attaching a backing device
> > (DRBD) invalidates bcache and prevents it from delivering stale
> > data. Alas the bcache documentation is pretty quiet here, from the
> > looks of it only detaching and re-attaching would achieve this.
> > 

Obviously, a bizarre choise.

> > 
> > 3. DRBD -> bcache -> HW cache -> HDD
> > 
> > The sane and simple approach, as writes will get replicated, no
> > additional dangers in the write path when compared to 1) above. 
> > 
> > If node A goes down and node B takes over, only previous (recent)
> > writes will be in the bcache on node B, the cache will be "cold"
> > otherwise. Once node A comes back the re-sync should hopefully take
> > care of all stale cache information on the A bcache.
> > 


Same rule applies here: don't connect your bcache backing device to
your SATA/SAS ports on motherboard, use the HW RAID controller instead,
so that your SSD cache is protected by the BBU. problem solved. 

As the bcache caches the same writes from both nodes, we can safely
suppose that the bcache states will be similar on both nodes at time of
crash...

regards,
-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac at intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: Signature digitale OpenPGP
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170906/e09d39dc/attachment.pgp>


More information about the drbd-user mailing list