[Drbd-dev] [RFC][PATCH] - DRBD PROT_D

Shriram Rajagopalan rshriram at gmail.com
Thu Jan 27 06:38:46 CET 2011

This patch adds a new protocol, Protocol D which operates like
protocol A, with the following differences:
 1. The system can operate in Dual Primary mode with the following properties:
    (a) Writes propagated to remote machine are "buffered" in memory
(epoch buffers)

    (b) When a checkpoint is issued by local machine (currently via an
ioctl interface), it is queued up as a P_BARRIER packet (with a
checkpoint number).

    (c) On receiving this special barrier packet, the Remote machine
acknowledges reception of all data buffers via P_CHECKPOINT_ACK and
then falls through the usual barrier sync code (creates a new epoch
and asynchronously start flushing the previous epoch buffer to disk).

    (d) The wait_for_checkpoint_ack ioctl returns success to caller on
receipt of P_CHECKPOINT_ACK. This constitutes a checkpoint commit.

    (e) If local machine fails, remote machine discards it's current
epoch (uncommitted checkpoint)

  2. The system can also operate in usual primary/secondary mode, in
which case,  the functionality is identical to that of Protocol A, ie
there are no checkpoints. Plain asynchronous replication.

What is the need for Dual Primary Mode?
  With the ability to operate in both modes (like protocol C), one
could switch on/off the checkpointing at any point in time. The same
functionality could be achieved in other ways but I wanted to use Prot
D with Xen/Live Migration - which requires the ability to operate a
resource in dual primary mode. [Please see the use case at the
end of this email]

  3. Data Resync Strategy:
  When a node comes back online, it needs to resync two sets of data
from the current primary node:
  (a) Blocks written by current primary, after the previous primary went down
  (b) Blocks written by the previous primary to its local disk during
the last unfinished
checkpoint, have to be overwritten with corresponding copies from the
current primary node [as it "discarded" the buffered writes in the
last unfinished checkpoint]

  Basically, its always a one way copy from new primary -> previous
primary. To facilitate this, the initial role of a node's resource is
always Secondary [thereby automatically causing one way resync from

One simple use case for this protocol is the Remus HA system
where, a virtual machine is checkpointed (memory and disk) at very
high frequencies (20 times/second). On failover, Remus resumes the
secondary machine resumes the VM from the last committed checkpoint
(memory, disk).
Remus piggybacks on Xen's Live Migration, which requires DRBD
resources to operate in dual primary mode.

I have been testing this for a couple of weeks with Remus + DRBD as a
full system HA.

The patch needs some clean up work but I would appreciate any feedback on
functionality improvements, concurrency bugs, etc.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd-8.3.10-protD.patch
Type: text/x-patch
Size: 23244 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-dev/attachments/20110126/0649e051/attachment-0001.bin>

More information about the drbd-dev mailing list