[Drbd-dev] SPDK + DRBD + tcmu-runner storage handlers

David Butterfield dab21774 at gmail.com
Thu Sep 19 19:11:19 CEST 2019

(Resending without the diagram attachment that is too large for the mailing list)
(Refer to diagram at https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf)

tcmu-runner block storage handlers running under SPDK
A prototype of a new block device module "bdev_tcmur" running under the Storage Performance
Development Kit allows access to block storage using tcmu-runner handlers.  (tcmu-runner itself
is not involved; only its loadable handlers are used here.)  The bdev_tcmur module is based on
the bdev_aio module source.  It enables the pathways for LUN 2 and LUN 3 shown in the diagram.

Distributed Replicated Block Device (DRBD 9.0) running in usermode
A recent project ported DRBD from the kernel to run in usermode as a Linux process, using
support from emulated kernel functions and a multi-threaded engine based on epoll_wait().  The
DRBD source code itself is unmodified, with its expected environment simulated around it.  It
receives requests from clients through the kernel's block-I/O ("bio") protocol, and also makes
requests to its backing storage using that same protocol.  Usermode DRBD can be plumbed under
Usermode SCST (not shown in this diagram), or under a FUSE interface (drbd1 in the diagram).

DRBD running with SPDK
To bring usermode DRBD into an SPDK process, a new SPDK bdev module "bdev_bio" implements
translation of SPDK block device requests into the kernel's block-I/O ("bio") protocol, as
expected by DRBD.  This enables the pathways for LUN 4 and LUN 5 shown in the diagram.

DRBD then makes bio requests to its backing storage, which at present must be a tcmu-runner
device.  To support arbitrary SPDK devices (e.g. use Malloc0 to back a DRBD device) requires a
"bio_bdev" module to translate bio requests into SPDK bdev protocol.  (TBD)

The SPDK configuration file plus an external helper provide enough for SPDK to configure DRBD
with the devices needed by SPDK.  Once the SPDK+DRBD server is up and running, the DRBD logic
can be controlled using the native DRBD management commands (drbdsetup and drbdadm).

The emulated kernel functions (UMC - usermode compatibility) make use of services provided by a
multithreaded event engine (MTE) implemented around epoll_wait().  The MTE services are accessed
by UMC through an ops vector backed by MTE services for memory, time, and threads, as well as
event polling of file descriptors, timers, and a FIFO of work to be done ASAP.  I anticipate an
easy time converting the ops vector to point at a shim to SPDK services in place of MTE calls.

The implementation is very new.  So far I have mainly tested it using the SPDK iSCSI server,
exporting tcmu-runner backend devices as SCSI LUNs.  That seems to work reliably.  The drbd and
tcmur devices can alternatively be mounted locally through the FUSE interface, which also works.

I have only tried it with one reactor core.

This prototype implementation is clearly in need of some cleaning up and interfaces straightened
out.  I've been studying SPDK for less then two weeks, and I guessed at a few things that I need
to go back over carefully.  But it runs.

The makefiles have optimizations turned off and debugs turned on.

The UMC FUSE implementation is single-threaded and synchronous; thus it operates at an effective
queue depth of one.  This matters most when using it to access replicated volumes with DRBD
Protocol C, where performance will suffer significantly.  Accessing volumes with Protocol A
configured to "pull-ahead" performs reasonably, as does accessing the same data through an iSCSI
LUN, which does not have the QD=1 limitation,

NOTE: Only tcmu-runner modules handler_ram.so and handler_file.so have been tried so far; the
      latter is significantly faster, so it is the one specified in the example configuration
      files.  An *async* tcmu-runner handler (nr_threads == 0) has yet to be tried!

Usermode DRBD Limitations
Netlink multicast emulation not yet implemented, so anything like "drbdsetup wait*" hangs.

The bio block device nodes are exposed through a mount of the server's UMC fuse filesystem
implementation.  The fuse-tree node that represents a DRBD or TCMUR block device appears as a
regular file rather than as a block device (because otherwise fuse directs I/O for that dev_t to
the kernel instead of the fuse filesystem server).  So when communicating with a usermode
server, the DRBD utilities are modified to omit the check that their device is S_IFBLK() rather
than S_IFREG().

Messages from the utilities and in the logs have not been modified, so will still refer to "the
kernel" etc when referring to code that has been ported from the kernel to usermode.

Resync may run noticeably slower when observing resync network traffic with tcpdump.

Something I expect NOT to work is running the server executable off of a disk it implements.

I have only run the usermode server on machines without DRBD installed in the kernel.  The build
script and the config/run instructions assume that there are no DRBD modules or utilities
installed.  (That would likely be very confusing, but might actually work if assigned separate

In general only the "happy path" has received any exercise -- expect bugs in untested error-
handling logic.

"Exclusive" opens aren't really exclusive, so be careful not to mount the same storage twice;
for example /UMCfuse/dev/file_c and /UMCfuse/dev/drbd1 are the same storage in the example
configuration.  For another example, SPDK configuration [BIO] for bdev_bio should never consume
both drbd2 and ram_b concurrently.  "Holders" and "claims" are not yet implemented.

The "writable" bits in the mode permissions do not appear correctly in /UMCfuse/dev.

The server apparently can mount and write a replicated DRBD device on a secondary node.

fsync/flush is probably ineffective.

4096 is the only tested block size; possible bugs with others.

Stacktrace is broken.

Probably there are broken untested refcountings on things that usually only get opened once.
(E.g. two concurrent dd commands to the same device or things like that).

Clean shutdown does not work at all.

I always "make clean" before make, because my makefiles don't calculate dependencies right.  The
makefiles are hateworthy.  SCST repository is unnecessarily tangled up with the build.

Sometimes DRBD resync doesn't start upon reconnect after restarting the server.  If it doesn't
start, disconnecting + reconnecting to the peer usually gets it going.

I have seen a very weird problem using the tcmu-runner handler_file.so.  After dlopen(),
libtcmur.c looks up the symbol for the handler_init routine and calls it.  The handler calls
back with the address of its ops vector.  The function addresses in the ops vector are properly
relocated for the loaded module, and the main module calls functions through the ops vector
thousands of times... and then suddenly SIGSEGV, and examining the ops vector (under gdb) the
function addresses are all back to their original UNRELOCATED relative values!  (And the
faulting program counter address matches the unrelocated value in the member of the ops vector
it was trying to call through.)  I have never seen this happen with handler_ram.

However, I have not seen the problem since I ensured adequate memory for the SPDK server.  The
SPDK test machine has "only" 4GiB RAM, and swap space used was increasing during problem tests.
Because handler_file runs significantly faster than handler_ram for mounted filesystems, all the
tcmu-runner handler devices in the example are now by default configured to use handler_file
(despite some names in /UMCfuse/dev and /tmp continuing to be called "ram" rather than "file").

Building from Source Code
The source code to build SPDK with support for tcmu-runner handlers is in my forks of the SPDK
and tcmu-runner repositories. Building-in DRBD support requires several additional repositories.
Because building is presently a mess, I've included scripts that will download the repositories
and build SPDK with support for tcmu-runner loadable handlers and/or DRBD.  To download and
build the SPDK iSCSI server with support for BOTH, cd into an empty directory and do:

    wget https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/BUILD_spdk_drbd.sh
    chmod 755 BUILD_spdk_drbd.sh

To OMIT DRBD and only download/build SPDK with support for tcmu-runner handlers do this instead:

    wget https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/BUILD_spdk_tcmur.sh
    chmod 755 BUILD_spdk_tcmur.sh

The (former) DRBD script downloads and builds a superset of what the (latter) TCMUR script does,
and after the DRBD download you can specify to build the more limited server (to support TCMUR
but not DRBD) by selection of configuration options:

    --with-tcmur			# SPDK with tcmu-runner only
    --with-tcmur --with-drbd		# SPDK with DRBD and tcmu-runner

Comments in the download/build scripts document the process in case you want to do some steps
manually.  (It asks for the sudo password to install, so you might want to look at it first.)

The SCRIPTS ASSUME you already have the tools and libraries installed such that you can build
the standard SPDK, DRBD, and tcmu-runner repositories.  Some of the makefiles require various
build tools -- here are package names I added to a fresh installation of Ubuntu 18.04 LTS to
complete the build:

    build-essential  g++  gcc  git  make  gdb  valgrind  cscope  exuberant-ctags
    libfuse-dev  libaio-dev  libglib2.0-dev  libkmod-dev  libnl-3-dev  libnl-genl-3-dev
    librbd-dev  autoconf  automake  flex  coccinelle  cmake

I always "make clean" before "make", because my makefiles don't calculate dependencies right.

There should be no compile errors, but there will be some warnings in the DRBD code.  The build
script documents a few that are expected and can be ignored for now.

The example config files in etc/drbd.d are from a node in my setup.  They will have to be
modified to suit your network configuration, and put into /etc/drbd.d on your test system.

There is also a nasty "helper" script /usr/sbin/drbdadm_up_primary which at present can only
bring up one specific SPDK/DRBD device in the example configuration.  To support a different
configuration, that file probably needs updating (in addition to /etc/drbd.d/* and the SPDK
configuration file).

To run the DRBD management utilities so that they refer to the simulated /proc that talks to the
usermode server process (rather than the real /proc that talks to the kernel):

    export UMC_FS_ROOT=/UMCfuse			    # *** SET ENVIRONMENT ***

The utilities need the $UMC_FS_ROOT environment variable set to control the usermode DRBD server
instead of a kernel-based server.  But they also need to run superuser.  Keep in mind that the
sudo program does not pass your shell environment through to the program given on its command
line, unless you specify "sudo -E".  (Omitting the "-E" leads to bewildering non-sequitur error
messages because the utility is trying to parse an earlier version of the command language)

Also the *server* needs the $UMC_FS_ROOT environment variable set, because it invokes the
utilities through a "usermode helper", and they inherit the variable from the server.

The download/build script ends with a suggested server command-line, that depends on which
script you used.  The two scripts refer to different configuration files depending on whether
DRBD support was selected or not.

If you didn't read the sections "Configuring" and "Running" just above, read those.

The implementation and configuration of SPDK+DRBD is an order of magnitude more complex than the
relatively straightforward implementation of tcmu-runner handlers under SPDK.  You may wish to
make sure the simpler case works before bringing in DRBD.

Make sure your configuration files were suitably modified for your names, addresses, etc.

Make sure you are running the server and the utilities with environment variable set:
    export UMC_FS_ROOT=/UMCfuse
    sudo -E drbdadm ...		# -E to pass the environment variable through sudo

Missing the environment variable leads to bewildering non-sequitur error messages because the
utility is trying to parse an earlier version of the command language.  These messages in the
server log or output from a DRBD utility probably mean the environment variable is not set:
    Cannot determine minor device number of device
    Missing connection endpoint argument
    Parse error: 'disk | device | address | meta-disk | flexible-meta-disk' expected,
	    but got 'node-id'

/proc and /sys/module entries for the DRBD usermode server can be observed under /UMCfuse.

After starting the server, a node should appear in /UMCfuse/dev for each bio or tcmu-runner
device configured by SPDK.  DRBD resource "nonspdk" (drbd1) is not configured as an SPDK device.
After the server is up the resource may be enabled using the native DRBD command, after which
its node should appear under /UMCfuse/dev:

    drbdadm up nonspdk	    # assumes metadata previously created

Multiple names can refer to the same underlying storage.  Referring to the diagram, LUN 5, bio1,
/UMCfuse/dev/drbd2, and /UMCfuse/dev/ram_b all refer to the same underlying storage in
/tmp/tcmur_ram01.  A filesystem can be mounted on an iSCSI initiator as LUN 5, or the same
filesystem can be mounted locally, e.g.

    sudo mount /UMCfuse/dev/drbd2 /mnt/x

One bug is that exclusive open is not currently exclusive, so be careful not to use storage
multiple ways at the same time!

More Information
The DRBD kernel source code ported to usermode is (within a dozen lines of) unmodified from the
original code in the LINBIT repository, with its expected kernel environment simulated around
it.  For more information about how that was done, see the README.md with diagrams at

David Butterfield						Tue 17 Sep 2019 09:43:35 PM MDT

More information about the drbd-dev mailing list