Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Jun 24, 2011 at 02:27:54PM +0100, Phil Stoneman wrote: > On 23/06/11 10:49, Phil Stoneman wrote: > >I need a bit of help understanding an issue we're seeing with our DRBD > >setup. We have two Debian systems, with DRBD 8.3.8. We have a resource > >configured with protocol A, which backs onto a single SATA disk on each > >side. > > > >I am running the latency test from the DRBD documentation, multiple times: > >while true; do dd if=/dev/zero of=$TEST_DEVICE bs=512 count=1000 > >oflag=direct; sync; done > > > >If I stop drbd and run it against the backing device on each side, it > >repeats again and again very quickly. If I start DRBD and run it against > >the DRBD device on the primary side, it runs quickly for about 4 or 5 > >repetitions, then slows right down. Investigation shows pe: is climbing > >in /proc/drbd, and iostat -mxd 1 on the secondary node shows 100% disk > >usage for the backing device. Note that this repeats in the other > >direction if I swap primary/secondary roles. It's only the secondary > >role that's seeing 100% disk usage, not the primary. > > > >When I use no-disk-barrier and no-disk-flushes, the problem goes away > >entirely - but I'm reluctant to enable this permanently, as they're just > >normal SATA drives without any battery backup or anything, and there are > >scary warnings about doing that in the documentation :-) > > After a bunch more testing, it looks like DRBD on the secondary side > only is not using (or regularly flushing) the write cache of the > underlying storage. It's not doing this on the primary side, and I > can actually see a reference to this in the manual[1]: "DRBD uses > disk flushes for write operations both to its replicated data set > and to its meta data.". > > I must be honest, I don't completely understand the rationale behind > utilising the write cache on the primary side but not utilising it > on the secondary side - it really hurts performance in some use > cases! > > Still, now that I know what's going on, I'm a little more > comfortable using no-disk-barrier and no-disk-flushes. It means that > I might lose data written into the drive's write cache, but that's > no worse a situation than a normal system using the hard drives > natively. > > > I'm still interested to hear the reason behind why drbd works that > way though... WRITE request on Primary to some area not yet marked as "hot" in the activity log. Write an activity log transaction, which marks this area as "hot", synchronously, with "BARRIER" or "FLUSH/FUA" if available. send the original write over to the peer. submit the original write to the local disk. with protocol A, once the local disk completes the write, signal completion to submitter (upper layers). This completion to upper layers also closes the current "epoch", or "reorder domain", because the next request submitted may have been waiting for this one to complete (write after write dependency). So because we closed that epoch, we send over a "DRBD barrier", to notify the peer of this reorder-domain boundary. If the peer receives such a "DRBD barrier" notification, it syncs and flushes its disk, and once that completes, acknowledges ("DRBD barrier ack") this to the Primary. This is done for two reasons: first, to enforce consistent write-after-write dependency behaviour. Second reason outlined below. Next request, potentially again to some not yet "hot" area, again activity log transaction. To mark one area hot, we typically need to mark the least recently used area as "cold" again, and so on. After recovery from Primary crash, we resync at least all areas that have been marked hot in the log, as well as any blocks explicitly marked as "changed". So much for context. If we mark things "cold", just because there are no outstanding completions to upper layer, we may forget to sync these areas after a primary crash (after all, the corresponding requests may not have reached the peer, or peers disk yet). Depending on what DRBD protocol was in use, and where requests have been at the time of crash (still in local buffers?), and possibly simultaneous crash of both nodes, on the next resync DRBD would assume areas to be clean and identical on both nodes which potentially are not, which may even go unnoticed for quite a while, but cause ill effects on some subsequent switchover. So we do not mark things cold, until we received the "DRBD barrier-ack" for any corresponding request that may have still been on flight to remote disk. So if you tell DRBD to not flush on such reorder-domain-boundaries, and you have volatile caches involved, you not only lose data that had been in the volatile caches, you may cause the DRBD replicas to diverge, silently, after any primary crash event. Note that we do not need to flush more often on the Primary, because we flush on bitmap/activity log update, and there will be no writes in flight to areas not marked hot in the activity log, and we resync all those "hot" areas after primary crash anyways. Potentially, we could do less flushes on the secondary as well, but probably at the cost of increased additional latency for any request that involves activity log transactions. Also note that, using a single threaded dd benchmark, that is the worst possible load: as you get single requests, and each will complete before the next one is issued, you get each request in its own "reorder domain", all separated by a "DRBD barrier", and thus each flushed through the cache on the secondary individually. Bottom line: get a good controller, avoid volatile caches. DRBD is NOT to help you get away with crappy hardware. But to help you, in case things break, despite of decent hardware. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed