[DRBD-user] DRBD versus bcache and caching in general.Amen

Thu Sep 7 04:28:07 CEST 2017

Hello,

On Wed, 6 Sep 2017 14:23:05 +0200 Emmanuel Florac wrote:

> Le Wed, 6 Sep 2017 10:37:42 +0900
> Christian Balzer <chibi at gol.com> écrivait:
> 
> > And once again, the deafening silence shall be broken by replying to
> > myself.  
> 
> Don't be bitter :)
>  
It's therapeutic. ^o^

> > The below is all on Debian Stretch with a 4.11 kernel.
> > 
> > I tested bcache w/o DRBD initially and the performance as well as
> > default behavior (not caching large IOs) was quite good and a perfect
> > fit for my use case (mailbox servers). 
> > 
> > This was true in combination with DRBD as well.
> > 
> > However it turns out that bcache will not work out of the box with
> > DRBD thanks to the slightly inane requirement by its udev helper to
> > identify things with lsblk. 
> > Which identifies the backing device as DRBD after a reboot and thus
> > doesn't auto-assemble the bcache device.
> > Hacking that udev rule or simply registering the backing device in
> > rc.local will do the trick, but it felt crude.  
> 
> In the contrary, this looks like perfectly sane Unix-fu to me. Worse is
> better, remember? Seriously, udev rules are made easy to write
> precisely for all these possible border cases.
>
I wasn't particular intimidated by the prospect of changing that udev
rule, but potentially having it overwritten by an upgrade left me with the
rc.local hack. 
Incidentally if I wouldn't be using pacemaker and thus DRBD being started 
very late in the game, the sequencing of udev rules might prevent things
from working anyway:
65-drbd.rules
69-bcache.rules

> > So I tried dm-cache, which doesn't have that particular issue.
> > But then again, the complexity of it (lvm in general), the vast
> > changes between versions and documentation gotchas, the fact that a
> > package required to assemble things at boot time wasn't
> > "required" (thin-provisioning-tools) also made this a rather painful
> > and involved experience compared to bache.   
> 
> That, and its performance sucks. In fact I couldn't get any significant
> gain using dm-cache, while bcache is very easily tuned to provide high
> gains. Hail bcache.
> 
> > Its performance is significantly lower and very spiky, with fio stdev
> > an order of magnitude higher than bcache.  
> 
> My opinion precisely.
> 
> > For example I could run 2 fio processes doing 4k randwrites capped to
> > 5k IOPS each (so 10k total) on top of the bcache DRBD indefinitely
> > with the backing device never getting busier than 10% when flushing
> > commenced. This test on the same HW with dm-cache yielded 8K IOPS
> > max, with high fluctuations and both the cache and backing devices
> > getting pegged at 100% busy at times.
> > 
> > What finally broke the straw was that with dm-cache, formatting the
> > drbd device with ext4 hung things to the point of requiring a forced
> > reboot. This was caused by mkfs.ext4 trying to discard blocks (same
> > for bcache), which is odd, but then again should just work (it does
> > for bcache). Formatting with nodiscard works and the dm-cache drbd
> > device then doesn't support fstrim when mounted, unlike bcache.
> >  
> > So I've settled for bcache at this time, the smoother performance is
> > worth the rc.local hack in my book.  
> 
> Amen to that.
> 
Another update to that, I managed to crash/reboot the primary DRBD node
when doing an e4defrag against the bcache'ed DRBD device. 
So don't do that, I guess.

> As I missed your previous post I'll reply below just in case my opinion
> may matter :)
> 
See below.

> > Christian
> > 
> > On Wed, 16 Aug 2017 12:37:21 +0900 Christian Balzer wrote:
> >   
> > > Hello,
> > > 
> > > firstly let me state that I of course read the old thread from 2014
> > > and all the other bits I could find.
> > > 
> > > If anybody in the last 3 years actually deployed bcache or any of
> > > the other SSD caching approaches with DRBD, I'd love to hear about
> > > it.
> > > 
> > > I'm looking to use bcache with DRBD in the near future and was
> > > pondering the following scenarios, not all bcache specific.
> > > 
> > > The failure case I'm most interested in is a node going down due to
> > > HW or kernel issues, as that's the only case I encountered in 10
> > > years. ^.^
> > > 
> > > 
> > > 1. DRBD -> RAID HW cache -> HDD
> > > 
> > > This is what I've been using for long time (in some cases w/o RAID
> > > controller and thus HW cache). 
> > > If node A spontaneously reboots due to a HW failure or kernel crash,
> > > things will fail over to node B, which is in best possible and up to
> > > date state at this point.
> > > Data in the HW cache (and the HDD local cache) is potentially lost.
> > > From the DRBD perspective block X has been successfully written to
> > > node A and B, even though it just reached the HW cache of the RAID
> > > controller. So in the worst case scenario (HW cache
> > > lost/invalidated, HDD caches also lost), we've just lost up to
> > > 4-5GB worth of in-flight data. And unless something changed those
> > > blocks on node B before node A comes back up, they will not be
> > > replicated back.
> > > 
> > > Is the above a correct, possible scenario?  
> 
> As I understand this scenario, you suppose that both your DRBD nodes
> have HW RAID controllers, that the node 1 fails, prompting a fail
> over to node B, then node B fails immediately? You shouldn't lose
> anything, provided that your HW RAID controller has a BBU, therefore
> CAN'T lose its cache.
> 
> Really, use a BBU. Or did I miss something?
> 
As I wrote, I've seen cases where the controller lost it's marbles in
regards to the cache (or anything else), completely independent of BBU or
power loss. 
But that's a special, rare case, obviously. 

> > > 
> > > As far as read caches are concerned, I'm pretty sure the HW caches
> > > get invalidated in regards to reads when a crash/reboot happens.
> > > 
> > > 
> > > 2. Bcache -> DRBD -> HW cache -> HDD
> > > 
> > > With bcache in writeback mode things become interesting in the
> > > Chinese sense. 
> > > If node A crashes, not only do we loose all the dirty kernel
> > > buffers (as always), but everything that was in-flight within
> > > bcache before being flushed to DRBD. 
> > > While the bcache documentation states that "Barriers/cache flushes
> > > are handled correctly." and thus hopefully at least the FS would be
> > > in a consistent state, the part that one needs to detach the bcache
> > > device or switch to writethrough mode before the backing device is
> > > clean and consistent confirms the potential for data loss.
> > > 
> > > I could live with bcache in write-through mode and leaving the write
> > > caching to the HW cache, if losing and re-attaching a backing device
> > > (DRBD) invalidates bcache and prevents it from delivering stale
> > > data. Alas the bcache documentation is pretty quiet here, from the
> > > looks of it only detaching and re-attaching would achieve this.
> > >   
> 
> Obviously, a bizarre choise.
> 
> > > 
> > > 3. DRBD -> bcache -> HW cache -> HDD
> > > 
> > > The sane and simple approach, as writes will get replicated, no
> > > additional dangers in the write path when compared to 1) above. 
> > > 
> > > If node A goes down and node B takes over, only previous (recent)
> > > writes will be in the bcache on node B, the cache will be "cold"
> > > otherwise. Once node A comes back the re-sync should hopefully take
> > > care of all stale cache information on the A bcache.
> > >   
> 
> 
> Same rule applies here: don't connect your bcache backing device to
> your SATA/SAS ports on motherboard, use the HW RAID controller instead,
> so that your SSD cache is protected by the BBU. problem solved. 
>
The SSDs in question are of course DC level ones, with full power loss
protection.

Keeping things separate also helps tremendously with performance. 
On the HW I'm testing/implementing this now moving the IRQ line for the
onboard AHCI from the default (shared with everything else) CPU 0 to
another core (on the same/correct NUMA node of course) improved 4K
randwrite IOPS from 50k to 70k.
I had already moved the Infiniband and RAID (Areca) controller IRQs off
CPU 0.

Christian

> As the bcache caches the same writes from both nodes, we can safely
> suppose that the bcache states will be similar on both nodes at time of
> crash...
> 
> regards,

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Rakuten Communications