[Drbd-dev] How Locking in GFS works...
Philipp Reisner
philipp.reisner at linbit.com
Fri Oct 8 14:32:09 CEST 2004
Hi Friends,
In reallity it is much more complex than we thought in the first
place.
I think that the solution with the "coordinator node" and the write
now packet would be simpler, but it's drawback is the additional
write now packet means that we have more packets on the wirte....
... But please read it first!
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GFS-mode-arbitration2-c.pdf
Type: application/pdf
Size: 10404 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20041008/1a2c5538/GFS-mode-arbitration2-c.pdf
-------------- next part --------------
9 Support shared disk semantics ( for GFS, OCFS etc... )
All the thoughts in this area, imply that the cluster deals
with split brain situations as discussed in item 6.
In order to offer a shared disk mode for GFS, we allow both
nodes to become primary. (This needs to be enabled with the
config statement net { allow-two-primaries; } )
Read after write dependencies
The shared state is available to clusters using protocol C
and B. It is not usable with protocol A.
To support the shared state with protocol B, upon a read
request the node has to check if a new version of the block
is in the progress of getting written. (== search for it on
active_ee and done_ee. [ Since it is on active_ee before the
RecvAck is sent. ] )
Global write order
[ Description of GFS-mode-arbitration2.pdf ]
1. Basic mirroring with protocol C.
The file system on N2 issues a write request towards DRBD,
which is written to the local disk and sent to N1. Then
the data bock is written to the local disk here and and
acknowledge packet is sent back. As soon as both the
write to the local disk and the ACK from N1 reach N2,
DRBD signals the completion of IO to the file system.
The major pitfall is the handling of concurrent writes to the
same block. (Concurrent writes to the same blocks should not
happen, but we have to assume that it is possible that the
synchronisation methods of our upper layer [i.e. openGFS]
may fail.)
There are many cases in which such concurrent writes would
lead to different data on our two copies of the block.
2. Concurrent writes, network latency is lower than disk latency
As we can see on the left side in figure two this could lead
to N1 has the blue version (=data from FS on N2) while N2
ends with having the green version (=data from FS on N1).
The solution is to flag one node (in the example N2 has the
discard-concurrent-writes-flag).
As we can see on the right side, now both nodes ends with
the blue data.
3. Concurrent writes, high latency for data packets.
The problem now is that N2 does can not detect that this was
a concurrent write, since it got the ACK before the conflicting
data packets comes in.
This can happens since in DRBD, data packets and ACK packets are
transmitted via two independent TCP connections, therefore the
ACK packet can overtakes a data packet.
The solution is to send with the ACK packet a discard info packet,
which identifies the data packet by it sequence number.
N2 will keep this discard info as long as it has not seen higher
sequence numbers by now.
With this both nodes will end with the blue data.
4. Concurrent writes, high latency for data packets.
This is the inverse case to case3 and already handled by the means
introduced with item 1.
5. New write while processing a write from the peer.
Without further measures this would lead to an inconsistency in
our mirror as the figure on the left side shows.
If we currently write a conflicting block from the peer, we simply
discard the write request from our FS and signal IO completion
immediately.
6. High disk latency on N2.
By IO reordering in the layers below us this could lead to
having the blue data on N2 and the green data on N1.
The solution to this case is the delay the write to the local
disk on N2 until the local write is done. This is different from
case two since we already got the write ACK to the conflicting
block.
7. An data packet overtakes an ACK packet on the network.
Although this case is quite unlikely, we have to take int into
account.
Proposed solution
We arbitrary select one node (e.g. the node that did the first
accept() in the drbd_connect() function) and mark it withe the
discard-concurrent-writes-flag.
Each data packet and each ACK packet gets a sequence
number, which is increased which every packet sent.
(This is a common space of sequence numbers)
The algorithm which is performed upon the reception of a
data packet [drbd_receiver].
* If the sequence number of the data packet is higher than
last_seq+1 sleep until last_seq-1 == seq_num(data packet)
1. If the packet's sequence number is on the discard list,
simply drop it.
2. Do we have a concurrent request? (i.e. Do I have a request
to the same block in my transfer log.) If not -> write now.
3. Have I already got an ACK packet for the concurrent
request ? (Has the request the RQ_DRBD_SENT bit already set)
If yes -> write the data from the data packet afterwards.
4. Do I have the "discard-concurrent-write-flag" ?
If yes -> discard the data packet.
If no -> Write data from the data packet afterwards and set
the RQ_DRBD_SENT bit in the request object ( Since
will will not get an ACK from our peer )
The algorithm which is performed upon the reception of an
ACK packet [drbd_asender]
* If we get an ACK, store the sequence number in last_seq.
The algorithm which is performed upon the reception of an
discard info packet [drbd_asender]
* if the current last_seq is lower the the packet that should
be discarded, store it in the to discard list.
BTW, each time we have a concurrent write access, we print
a warning to the syslog, since this indicates that the layer
above us is broken!
Note: In Item 6 we created a hash table over all requests in the
transfer log, keyed with (sector & ~0x7). This allows us
to find IO operations starting in the same 4k block of
data quickly. -> With two lookups the hash table we can
find any concurrent access.
More information about the drbd-dev
mailing list