[DRBD-cvs] r1550 - trunk

Tue Sep 21 17:37:13 CEST 2004

Author: phil
Date: 2004-09-21 17:37:10 +0200 (Tue, 21 Sep 2004)
New Revision: 1550

Added:
   trunk/ROADMAP
Log:
What we want to do...



Added: trunk/ROADMAP
===================================================================

--- trunk/ROADMAP	2004-09-21 11:05:25 UTC (rev 1549)
+++ trunk/ROADMAP	2004-09-21 15:37:10 UTC (rev 1550)
@@ -0,0 +1,200 @@
+DRBD 0.8 Roadmap
+----------------
+
+1 Drop support for linux-2.4.x. 
+  Do all size calculations on the base of sectors (512 Byte) as it 
+  is common in Linux-2.6.x.
+  (Currently they are done on a 1k base, for 2.4.x compatibility)
+
+2 Drop the Drbd_Parameter_Packet.
+  Replace the Drbd_Parameter_Packet by a more general and 
+  extensible mechanism.
+
+3 Authenticate the peer upon connect by using a shared secret. 
+  Config file syntax:  net { auth-secret "secret-word" }
+  Using a challenge-response authentication within the new
+  handshake. 
+
+4 Changes of state and cstate synchronized by mutex and only done by
+  the worker thread.
+
+5 Two new config options, to allow more fine grained definition of
+  DRDBs behaviour after a split-brain situation:
+
+  after-sb-2pri = 
+   disconnect     No automatic resynchronisation gets performed. One
+                  node should drop its net-conf (preferable the
+                  node that would become sync-target)
+                  DEFAULT.
+   asf-older      Auto sync from is the oder primary (curr.behaviour i.t.s.)
+   asf-younger    Auto sync from is the younger primary
+   asf-furthest   Auto sync from is the node that did more modifications
+   asf-NODENAME   Auto sync from is the named node 
+  
+  
+  pri-sees-sec-with-higher-gc =
+   disconnect     (current behaviour)
+   asf-primary    Auto sync from is the current primary
+   panic          The current primary panics. The node with the
+                  higher gc should take over.
+  
+  
+  Notes:
+  1) The disconnect actions cause the sync-target or the secondary
+     (better both) node to go into StandAlone state.
+  2) If two nodes in primary state try to connect one (better both)
+     of them goes into StandAlone state (=curr. behaviour)
+  3) As soon as the decision is takes the sync-target adopts the
+     GC of the sync source. 
+     [ The whole algorithm would also work if both would reset their 
+       GCs to <0,0,0...> after the decision, but since we also
+       use the GC to tag the bitmap it is better the current way ]
+
+6 It is possible that a secondary node crashes a primary by 
+  returning invalid block_ids in ACK packets. [This might be 
+  either caused by faulty hardware, or by a hostile modification
+  of DRBD on the secondary node]
+
+  Proposed solution:
+
+  Have a hash table (hlist_head style), add the collision
+  member (hlist_node) to drbd_request. 
+
+  Use the pointer to the drbd_request as key to the hash, each
+  drbd_request is also put into this hash table. We still use the 
+  pointer as block_id. 
+
+  When we get an ACK packet, we lookup the hash table with the
+  block_id, and may find the drbd_request there. Otherwise it 
+  was a forged ACK.
+
+7 Handle split brain situations; Support IO fencing; 
+  introduce the "Dead" peer state (o_state)
+
+  New commands:
+    drbdadm peer-dead r0
+    drbdadm [ considered-dead | die | fence | outdate ] r0 
+      ( What do you like best ? Suggestions ? )
+
+  remove option value: on-disconnect=freeze_io
+
+  introduce: 
+    peer-state-unknown=freeze_io
+    peer-state-unknown=continue_io
+
+  New meta-data flag: "Outdated"
+
+  Let us assume that we have two boxes (N1 and N2) and that these
+  two boxes are connected by two networks (net and cnet [ clinets'-net ]).
+
+  Net is used by DRBD, while heartbeat uses both, net and cnet
+
+  I know that you are talking about fencing by STONITH, but DRBD is
+  not limited to that. Here comes my understanding of how fencing
+  (other than STONITH) should work with DRBD-0.8 :
+
+   N1  net   N2
+   P/S ---  S/P     everything up and running.
+   P/? - -  S/?     network breaks ; N1 freezes IO
+   P/? - -  S/?     N1 fences N2:
+                    In the STONITH case: turn off N2.
+                    In the "smart" case: 
+                    N1 asks N2 to fence itself from the storage via cnet.
+                    HB calls "drbdadm fence r0" on N2.
+                    N2 replies to N1 that fencing is done via cnet.
+                    N1 calls "drbdadm peer-dead r0".
+   P/D - -  S/?     N1 thaws IO
+
+  N2 got the the "Outdated" flag set in its meta-data, by the "fence" 
+  command. I am not sure if it should be called "fence", other ideas:
+  "considered-dead","die","fence","outdate". What do you think ?
+
+8 New command drbdmeta
+
+  We move the read_gc.pl/write_gc.pl to the user directory. 
+  Make them to one C program: drbdmeta
+   -> in the future the module never creates the meta data
+      block. One can use drbdmeta to create, read and 
+      modify the drbdmeta block. drbdmeta refuses to write
+      to it as long as the module is loaded (configured).
+
+  drbdsetup gets the ability to read the gc values while DRBD
+  is set up via an ioctl() call. -- drbdmeta refuses to run
+  if DRBD is configured. 
+
+  drbdadm is the nice frontend. It alsways uses the right 
+  backend (drbdmeta or drbdsetup)...
+
+  drbdadm md-set-gc 1:2:3:4:5:6 r0
+  drbdadm md-get-gc r0
+  drbdadm md-get/set-{la-size|consistent|etc...} resources....
+  drbdadm md-create r0
+
+9 Support shared disk semantics  ( for GFS, OCFS etc... )
+
+    All the thoughts in this area, imply that the cluster deals
+    with split brain situations as discussed in item 6.
+
+  In order to offer a shared disk mode for GFS, we introduce a 
+  new state "shared" (in addition to primary and secondary).
+
+  In a cluster of two nodes in shared state we determine a 
+  coordinator node (e.g. by selecting the node with the 
+  numeric higher IP address)
+
+ read after write dependencies
+
+  The shared state is available to clusters using protocol C
+  and B. It is not usable with protocol A.
+
+  To support the shared state with protocol B, upon a read
+  request the node has to check if a new version of the block
+  is in the progress of getting written. (== search for it on
+  active_ee and done_ee, must make sure that it is on active_ee
+  before the RecvAck is sent. [is already the case.] )
+  
+ global write order
+
+  As far as I understand the toppic up to now we have two options
+  to establish a global write order. 
+
+  Proposed Solution 1, using the order of a coordinator node:
+
+  Writes from the coordinator node are carried out, as they are
+  carried out on the primary node in conventional DRBD. ( Write 
+  to disk and send to peer simultaniously. )
+
+  Writes from the other node are sent to the coordinator first, 
+  then the coordinator inserts a small "write now" packet into
+  its stram of write packets.
+  The node commits the write to its local IO subsystem as soon 
+  as it gets the "write-now" packet from the coordinator.
+
+  Note: With protocol C it does not matter which node is the
+        coordinator from the performance viewpoint.
+
+  Proposed Solution 2, use ALs as distributed locks:
+
+  Only one node might mark an extent as active at a time. New
+  packets are introduced to request the locking of an extent.
+
+
+plus-banches:
+----------------------
+
+1 wait-sync-target  
+
+2 Implement the checksum based resync. 
+
+3 3 node support. Do and test a 3 node setup (2nd DRBD stacked over
+  a DRBD pair). Enhance the user level tools to support the 3 node
+  setup.
+
+4 Change the bitmap code to work with unmapped highmem pages, instead
+  of using vmalloc()ed memory. This allows users of 32bit platforms
+  to use drbd on big devices (in the ~3TB range)
+
+5 Support for variable sized meta data (esp bitmap) = Support for more
+  than 4TB of storage.
+
+6 Support to pass LockFS calls / make taking of snapshots possible (?)