Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, On Wed, 6 Sep 2017 14:23:05 +0200 Emmanuel Florac wrote: > Le Wed, 6 Sep 2017 10:37:42 +0900 > Christian Balzer <chibi at gol.com> écrivait: > > > And once again, the deafening silence shall be broken by replying to > > myself. > > Don't be bitter :) > It's therapeutic. ^o^ > > The below is all on Debian Stretch with a 4.11 kernel. > > > > I tested bcache w/o DRBD initially and the performance as well as > > default behavior (not caching large IOs) was quite good and a perfect > > fit for my use case (mailbox servers). > > > > This was true in combination with DRBD as well. > > > > However it turns out that bcache will not work out of the box with > > DRBD thanks to the slightly inane requirement by its udev helper to > > identify things with lsblk. > > Which identifies the backing device as DRBD after a reboot and thus > > doesn't auto-assemble the bcache device. > > Hacking that udev rule or simply registering the backing device in > > rc.local will do the trick, but it felt crude. > > In the contrary, this looks like perfectly sane Unix-fu to me. Worse is > better, remember? Seriously, udev rules are made easy to write > precisely for all these possible border cases. > I wasn't particular intimidated by the prospect of changing that udev rule, but potentially having it overwritten by an upgrade left me with the rc.local hack. Incidentally if I wouldn't be using pacemaker and thus DRBD being started very late in the game, the sequencing of udev rules might prevent things from working anyway: 65-drbd.rules 69-bcache.rules > > So I tried dm-cache, which doesn't have that particular issue. > > But then again, the complexity of it (lvm in general), the vast > > changes between versions and documentation gotchas, the fact that a > > package required to assemble things at boot time wasn't > > "required" (thin-provisioning-tools) also made this a rather painful > > and involved experience compared to bache. > > That, and its performance sucks. In fact I couldn't get any significant > gain using dm-cache, while bcache is very easily tuned to provide high > gains. Hail bcache. > > > Its performance is significantly lower and very spiky, with fio stdev > > an order of magnitude higher than bcache. > > My opinion precisely. > > > For example I could run 2 fio processes doing 4k randwrites capped to > > 5k IOPS each (so 10k total) on top of the bcache DRBD indefinitely > > with the backing device never getting busier than 10% when flushing > > commenced. This test on the same HW with dm-cache yielded 8K IOPS > > max, with high fluctuations and both the cache and backing devices > > getting pegged at 100% busy at times. > > > > What finally broke the straw was that with dm-cache, formatting the > > drbd device with ext4 hung things to the point of requiring a forced > > reboot. This was caused by mkfs.ext4 trying to discard blocks (same > > for bcache), which is odd, but then again should just work (it does > > for bcache). Formatting with nodiscard works and the dm-cache drbd > > device then doesn't support fstrim when mounted, unlike bcache. > > > > So I've settled for bcache at this time, the smoother performance is > > worth the rc.local hack in my book. > > Amen to that. > Another update to that, I managed to crash/reboot the primary DRBD node when doing an e4defrag against the bcache'ed DRBD device. So don't do that, I guess. > As I missed your previous post I'll reply below just in case my opinion > may matter :) > See below. > > Christian > > > > On Wed, 16 Aug 2017 12:37:21 +0900 Christian Balzer wrote: > > > > > Hello, > > > > > > firstly let me state that I of course read the old thread from 2014 > > > and all the other bits I could find. > > > > > > If anybody in the last 3 years actually deployed bcache or any of > > > the other SSD caching approaches with DRBD, I'd love to hear about > > > it. > > > > > > I'm looking to use bcache with DRBD in the near future and was > > > pondering the following scenarios, not all bcache specific. > > > > > > The failure case I'm most interested in is a node going down due to > > > HW or kernel issues, as that's the only case I encountered in 10 > > > years. ^.^ > > > > > > > > > 1. DRBD -> RAID HW cache -> HDD > > > > > > This is what I've been using for long time (in some cases w/o RAID > > > controller and thus HW cache). > > > If node A spontaneously reboots due to a HW failure or kernel crash, > > > things will fail over to node B, which is in best possible and up to > > > date state at this point. > > > Data in the HW cache (and the HDD local cache) is potentially lost. > > > From the DRBD perspective block X has been successfully written to > > > node A and B, even though it just reached the HW cache of the RAID > > > controller. So in the worst case scenario (HW cache > > > lost/invalidated, HDD caches also lost), we've just lost up to > > > 4-5GB worth of in-flight data. And unless something changed those > > > blocks on node B before node A comes back up, they will not be > > > replicated back. > > > > > > Is the above a correct, possible scenario? > > As I understand this scenario, you suppose that both your DRBD nodes > have HW RAID controllers, that the node 1 fails, prompting a fail > over to node B, then node B fails immediately? You shouldn't lose > anything, provided that your HW RAID controller has a BBU, therefore > CAN'T lose its cache. > > Really, use a BBU. Or did I miss something? > As I wrote, I've seen cases where the controller lost it's marbles in regards to the cache (or anything else), completely independent of BBU or power loss. But that's a special, rare case, obviously. > > > > > > As far as read caches are concerned, I'm pretty sure the HW caches > > > get invalidated in regards to reads when a crash/reboot happens. > > > > > > > > > 2. Bcache -> DRBD -> HW cache -> HDD > > > > > > With bcache in writeback mode things become interesting in the > > > Chinese sense. > > > If node A crashes, not only do we loose all the dirty kernel > > > buffers (as always), but everything that was in-flight within > > > bcache before being flushed to DRBD. > > > While the bcache documentation states that "Barriers/cache flushes > > > are handled correctly." and thus hopefully at least the FS would be > > > in a consistent state, the part that one needs to detach the bcache > > > device or switch to writethrough mode before the backing device is > > > clean and consistent confirms the potential for data loss. > > > > > > I could live with bcache in write-through mode and leaving the write > > > caching to the HW cache, if losing and re-attaching a backing device > > > (DRBD) invalidates bcache and prevents it from delivering stale > > > data. Alas the bcache documentation is pretty quiet here, from the > > > looks of it only detaching and re-attaching would achieve this. > > > > > Obviously, a bizarre choise. > > > > > > > 3. DRBD -> bcache -> HW cache -> HDD > > > > > > The sane and simple approach, as writes will get replicated, no > > > additional dangers in the write path when compared to 1) above. > > > > > > If node A goes down and node B takes over, only previous (recent) > > > writes will be in the bcache on node B, the cache will be "cold" > > > otherwise. Once node A comes back the re-sync should hopefully take > > > care of all stale cache information on the A bcache. > > > > > > Same rule applies here: don't connect your bcache backing device to > your SATA/SAS ports on motherboard, use the HW RAID controller instead, > so that your SSD cache is protected by the BBU. problem solved. > The SSDs in question are of course DC level ones, with full power loss protection. Keeping things separate also helps tremendously with performance. On the HW I'm testing/implementing this now moving the IRQ line for the onboard AHCI from the default (shared with everything else) CPU 0 to another core (on the same/correct NUMA node of course) improved 4K randwrite IOPS from 50k to 70k. I had already moved the Infiniband and RAID (Areca) controller IRQs off CPU 0. Christian > As the bcache caches the same writes from both nodes, we can safely > suppose that the bcache states will be similar on both nodes at time of > crash... > > regards, -- Christian Balzer Network/Systems Engineer chibi at gol.com Rakuten Communications