[DRBD-cvs] r1582 - trunk

Tue Oct 5 21:38:09 CEST 2004

Author: phil
Date: 2004-10-05 21:38:06 +0200 (Tue, 05 Oct 2004)
New Revision: 1582

Modified:
   trunk/ROADMAP
Log:
Updates to item 9 (GFS mode)


Modified: trunk/ROADMAP
===================================================================

--- trunk/ROADMAP	2004-10-05 17:51:18 UTC (rev 1581)
+++ trunk/ROADMAP	2004-10-05 19:38:06 UTC (rev 1582)
@@ -150,63 +150,62 @@
     All the thoughts in this area, imply that the cluster deals
     with split brain situations as discussed in item 6.
 
-  In order to offer a shared disk mode for GFS, we introduce a 
-  new state "shared" (in addition to primary and secondary).
+  In order to offer a shared disk mode for GFS, we allow both
+  nodes to become primary. (This needs to be enabled with the
+  config statement net { allow-two-primaries; } )
 
-  In a cluster of two nodes in shared state we determine a 
-  coordinator node (e.g. by selecting the node with the 
-  numeric higher IP address)
+ Read after write dependencies
 
- read after write dependencies
-
   The shared state is available to clusters using protocol C
   and B. It is not usable with protocol A.
 
   To support the shared state with protocol B, upon a read
   request the node has to check if a new version of the block
   is in the progress of getting written. (== search for it on
-  active_ee and done_ee, must make sure that it is on active_ee
-  before the RecvAck is sent. [is already the case.] )
+  active_ee and done_ee. [ Since it is on active_ee before the 
+  RecvAck is sent. ] )
   
- global write order
+ Global write order
 
-  As far as I understand the topic up to now we have two options
-  to establish a global write order. 
+  The major pitfall is the handling of concurrent writes to the
+  same block. (Concurrent writes to the same blocks should not 
+  happen, but we have to assume that it is possible that the
+  synchronisation methods of our upper layer [i.e. openGFS] 
+  may fail.)
 
-  Proposed Solution 1, using the order of a coordinator node:
+  Without further handling concurrent writes to the same block
+  would get written on each node locally first, then sent
+  to the peer and then overwrite the local version on the peer.
+  In other words, each node would write its local version first,
+  and the peers version of the data.
 
-  Writes from the coordinator node are carried out, as they are
-  carried out on the primary node in conventional DRBD. ( Write 
-  to disk and send to peer simultaneously. )
+  Both nodes need to agree to _one_ order, in which such 
+  conflicting writes should be carried out.
 
-  Writes from the other node are sent to the coordinator first, 
-  then the coordinator inserts a small "write now" packet into
-  its stream of write packets.
-  The node commits the write to its local IO subsystem as soon 
-  as it gets the "write-now" packet from the coordinator.
+  Proposed Solution
 
-  Note: With protocol C it does not matter which node is the
-        coordinator from the performance viewpoint.
+  We arbitrary select one node (e.g. the node that did the first
+  accept() in the drbd_connect() function) and mark it withe the
+  discard-concurrent-write-flag.
 
-  Proposed Solution 2, use a dedicated LRU to implement locking:
+  The algorithm which is performed upon the reception of a 
+  data packet.
 
-  Each extent in the locking LRU can have on of these states:
-    requested
-    locked-by-peer
-    locked-by-me
-    locked-by-me-and-requested-by-peer
+  1. Do we have a concurrent request? (i.e. Do I have a request
+     to the same block in my transfer log.) If not -> write now.
+  2. Have I already got an ACK packet for the concurrent 
+     request ? (Has the request the RQ_DRBD_SENT bit already set)
+     If yes -> write the data from the data packet afterwards.
+  3. Do I have the "discard-concurrent-write-flag" ?
+     If yes -> discard the data packet and send an discard notify.
+     If no -> Write data from the data packet afterwards.
 
-  We allow application writes only to extents which are in
-  locked-by-me* state. 
+  BTW, each time we have a concurrent write access, we print
+  a warning to the syslog, since this indicates that the layer
+  above us is broken!
 
-  New Packets:
-    LockExtent
-    LockExtentAck
+  [ see also GFS-mode-arbitration.pdf for illustration. ]
 
-  Configuration directives: dl-extents , dl-extent-size
-
-  TODO: Need to verify with GFS that this makes sense.
-
 10 Change Sync-groups to sync-after
   
   Sync groups turned out to be hard to configure and more