[Drbd-dev] How Locking in GFS works...

Fri Oct 8 14:32:09 CEST 2004

Hi Friends,

In reallity it is much more complex than we thought in the first
place.

I think that the solution with the "coordinator node" and the write
now packet would be simpler, but it's drawback is the additional
write now packet means that we have more packets on the wirte....

... But please read it first!

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GFS-mode-arbitration2-c.pdf
Type: application/pdf
Size: 10404 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20041008/1a2c5538/GFS-mode-arbitration2-c.pdf
-------------- next part --------------
9 Support shared disk semantics  ( for GFS, OCFS etc... )

    All the thoughts in this area, imply that the cluster deals
    with split brain situations as discussed in item 6.

  In order to offer a shared disk mode for GFS, we allow both
  nodes to become primary. (This needs to be enabled with the
  config statement net { allow-two-primaries; } )

 Read after write dependencies

  The shared state is available to clusters using protocol C
  and B. It is not usable with protocol A.

  To support the shared state with protocol B, upon a read
  request the node has to check if a new version of the block
  is in the progress of getting written. (== search for it on
  active_ee and done_ee. [ Since it is on active_ee before the 
  RecvAck is sent. ] )

 Global write order

  [ Description of GFS-mode-arbitration2.pdf ]

  1. Basic mirroring with protocol C.
    The file system on N2 issues a write request towards DRBD, 
    which is written to the local disk and sent to N1. Then
    the data bock is written to the local disk here and and
    acknowledge packet is sent back. As soon as both the
    write to the local disk and the ACK from N1 reach N2, 
    DRBD signals the completion of IO to the file system.

    The major pitfall is the handling of concurrent writes to the
    same block. (Concurrent writes to the same blocks should not 
    happen, but we have to assume that it is possible that the
    synchronisation methods of our upper layer [i.e. openGFS] 
    may fail.)

    There are many cases in which such concurrent writes would
    lead to different data on our two copies of the block. 

  2. Concurrent writes, network latency is lower than disk latency
    As we can see on the left side in figure two this could lead
    to N1 has the blue version (=data from FS on N2) while N2
    ends with having the green version (=data from FS on N1).
    The solution is to flag one node (in the example N2 has the
    discard-concurrent-writes-flag).
    As we can see on the right side, now both nodes ends with 
    the blue data.

  3. Concurrent writes, high latency for data packets.
    The problem now is that N2 does can not detect that this was
    a concurrent write, since it got the ACK before the conflicting
    data packets comes in. 
    This can happens since in DRBD, data packets and ACK packets are
    transmitted via two independent TCP connections, therefore the
    ACK packet can overtakes a data packet.
    The solution is to send with the ACK packet a discard info packet,
    which identifies the data packet by it sequence number.
    N2 will keep this discard info as long as it has not seen higher
    sequence numbers by now.
    With this both nodes will end with the blue data.

  4. Concurrent writes, high latency for data packets.
    This is the inverse case to case3 and already handled by the means
    introduced with item 1. 

  5. New write while processing a write from the peer.
    Without further measures this would lead to an inconsistency in 
    our mirror as the figure on the left side shows. 
    If we currently write a conflicting block from the peer, we simply
    discard the write request from our FS and signal IO completion 
    immediately.

  6. High disk latency on N2.
    By IO reordering in the layers below us this could lead to 
    having the blue data on N2 and the green data on N1. 
    The solution to this case is the delay the write to the local
    disk on N2 until the local write is done. This is different from
    case two since we already got the write ACK to the conflicting
    block.

  7. An data packet overtakes an ACK packet on the network.
    Although this case is quite unlikely, we have to take int into 
    account. 

 Proposed solution

  We arbitrary select one node (e.g. the node that did the first
  accept() in the drbd_connect() function) and mark it withe the
  discard-concurrent-writes-flag.

  Each data packet and each ACK packet gets a sequence 
  number, which is increased which every packet sent. 
  (This is a common space of sequence numbers)

  The algorithm which is performed upon the reception of a 
  data packet [drbd_receiver].

  *  If the sequence number of the data packet is higher than
     last_seq+1 sleep until last_seq-1 == seq_num(data packet)

  1. If the packet's sequence number is on the discard list,
     simply drop it.
  2. Do we have a concurrent request? (i.e. Do I have a request
     to the same block in my transfer log.) If not -> write now.
  3. Have I already got an ACK packet for the concurrent 
     request ? (Has the request the RQ_DRBD_SENT bit already set)
     If yes -> write the data from the data packet afterwards.
  4. Do I have the "discard-concurrent-write-flag" ?
     If yes -> discard the data packet.
     If no -> Write data from the data packet afterwards and set
              the RQ_DRBD_SENT bit in the request object ( Since
              will will not get an ACK from our peer )

  The algorithm which is performed upon the reception of an 
  ACK packet [drbd_asender]

  * If we get an ACK, store the sequence number in last_seq.

  The algorithm which is performed upon the reception of an 
  discard info packet [drbd_asender]

  * if the current last_seq is lower the the packet that should
    be discarded, store it in the to discard list.

  BTW, each time we have a concurrent write access, we print
  a warning to the syslog, since this indicates that the layer
  above us is broken!

  Note: In Item 6 we created a hash table over all requests in the
        transfer log, keyed with (sector & ~0x7). This allows us
        to find IO operations starting in the same 4k block of
        data quickly. -> With two lookups the hash table we can
	find any concurrent access.