[DRBD-cvs] r1550 - trunk
svn at svn.drbd.org
svn at svn.drbd.org
Tue Sep 21 17:37:13 CEST 2004
Author: phil
Date: 2004-09-21 17:37:10 +0200 (Tue, 21 Sep 2004)
New Revision: 1550
Added:
trunk/ROADMAP
Log:
What we want to do...
Added: trunk/ROADMAP
===================================================================
--- trunk/ROADMAP 2004-09-21 11:05:25 UTC (rev 1549)
+++ trunk/ROADMAP 2004-09-21 15:37:10 UTC (rev 1550)
@@ -0,0 +1,200 @@
+DRBD 0.8 Roadmap
+----------------
+
+1 Drop support for linux-2.4.x.
+ Do all size calculations on the base of sectors (512 Byte) as it
+ is common in Linux-2.6.x.
+ (Currently they are done on a 1k base, for 2.4.x compatibility)
+
+2 Drop the Drbd_Parameter_Packet.
+ Replace the Drbd_Parameter_Packet by a more general and
+ extensible mechanism.
+
+3 Authenticate the peer upon connect by using a shared secret.
+ Config file syntax: net { auth-secret "secret-word" }
+ Using a challenge-response authentication within the new
+ handshake.
+
+4 Changes of state and cstate synchronized by mutex and only done by
+ the worker thread.
+
+5 Two new config options, to allow more fine grained definition of
+ DRDBs behaviour after a split-brain situation:
+
+ after-sb-2pri =
+ disconnect No automatic resynchronisation gets performed. One
+ node should drop its net-conf (preferable the
+ node that would become sync-target)
+ DEFAULT.
+ asf-older Auto sync from is the oder primary (curr.behaviour i.t.s.)
+ asf-younger Auto sync from is the younger primary
+ asf-furthest Auto sync from is the node that did more modifications
+ asf-NODENAME Auto sync from is the named node
+
+
+ pri-sees-sec-with-higher-gc =
+ disconnect (current behaviour)
+ asf-primary Auto sync from is the current primary
+ panic The current primary panics. The node with the
+ higher gc should take over.
+
+
+ Notes:
+ 1) The disconnect actions cause the sync-target or the secondary
+ (better both) node to go into StandAlone state.
+ 2) If two nodes in primary state try to connect one (better both)
+ of them goes into StandAlone state (=curr. behaviour)
+ 3) As soon as the decision is takes the sync-target adopts the
+ GC of the sync source.
+ [ The whole algorithm would also work if both would reset their
+ GCs to <0,0,0...> after the decision, but since we also
+ use the GC to tag the bitmap it is better the current way ]
+
+6 It is possible that a secondary node crashes a primary by
+ returning invalid block_ids in ACK packets. [This might be
+ either caused by faulty hardware, or by a hostile modification
+ of DRBD on the secondary node]
+
+ Proposed solution:
+
+ Have a hash table (hlist_head style), add the collision
+ member (hlist_node) to drbd_request.
+
+ Use the pointer to the drbd_request as key to the hash, each
+ drbd_request is also put into this hash table. We still use the
+ pointer as block_id.
+
+ When we get an ACK packet, we lookup the hash table with the
+ block_id, and may find the drbd_request there. Otherwise it
+ was a forged ACK.
+
+7 Handle split brain situations; Support IO fencing;
+ introduce the "Dead" peer state (o_state)
+
+ New commands:
+ drbdadm peer-dead r0
+ drbdadm [ considered-dead | die | fence | outdate ] r0
+ ( What do you like best ? Suggestions ? )
+
+ remove option value: on-disconnect=freeze_io
+
+ introduce:
+ peer-state-unknown=freeze_io
+ peer-state-unknown=continue_io
+
+ New meta-data flag: "Outdated"
+
+ Let us assume that we have two boxes (N1 and N2) and that these
+ two boxes are connected by two networks (net and cnet [ clinets'-net ]).
+
+ Net is used by DRBD, while heartbeat uses both, net and cnet
+
+ I know that you are talking about fencing by STONITH, but DRBD is
+ not limited to that. Here comes my understanding of how fencing
+ (other than STONITH) should work with DRBD-0.8 :
+
+ N1 net N2
+ P/S --- S/P everything up and running.
+ P/? - - S/? network breaks ; N1 freezes IO
+ P/? - - S/? N1 fences N2:
+ In the STONITH case: turn off N2.
+ In the "smart" case:
+ N1 asks N2 to fence itself from the storage via cnet.
+ HB calls "drbdadm fence r0" on N2.
+ N2 replies to N1 that fencing is done via cnet.
+ N1 calls "drbdadm peer-dead r0".
+ P/D - - S/? N1 thaws IO
+
+ N2 got the the "Outdated" flag set in its meta-data, by the "fence"
+ command. I am not sure if it should be called "fence", other ideas:
+ "considered-dead","die","fence","outdate". What do you think ?
+
+8 New command drbdmeta
+
+ We move the read_gc.pl/write_gc.pl to the user directory.
+ Make them to one C program: drbdmeta
+ -> in the future the module never creates the meta data
+ block. One can use drbdmeta to create, read and
+ modify the drbdmeta block. drbdmeta refuses to write
+ to it as long as the module is loaded (configured).
+
+ drbdsetup gets the ability to read the gc values while DRBD
+ is set up via an ioctl() call. -- drbdmeta refuses to run
+ if DRBD is configured.
+
+ drbdadm is the nice frontend. It alsways uses the right
+ backend (drbdmeta or drbdsetup)...
+
+ drbdadm md-set-gc 1:2:3:4:5:6 r0
+ drbdadm md-get-gc r0
+ drbdadm md-get/set-{la-size|consistent|etc...} resources....
+ drbdadm md-create r0
+
+9 Support shared disk semantics ( for GFS, OCFS etc... )
+
+ All the thoughts in this area, imply that the cluster deals
+ with split brain situations as discussed in item 6.
+
+ In order to offer a shared disk mode for GFS, we introduce a
+ new state "shared" (in addition to primary and secondary).
+
+ In a cluster of two nodes in shared state we determine a
+ coordinator node (e.g. by selecting the node with the
+ numeric higher IP address)
+
+ read after write dependencies
+
+ The shared state is available to clusters using protocol C
+ and B. It is not usable with protocol A.
+
+ To support the shared state with protocol B, upon a read
+ request the node has to check if a new version of the block
+ is in the progress of getting written. (== search for it on
+ active_ee and done_ee, must make sure that it is on active_ee
+ before the RecvAck is sent. [is already the case.] )
+
+ global write order
+
+ As far as I understand the toppic up to now we have two options
+ to establish a global write order.
+
+ Proposed Solution 1, using the order of a coordinator node:
+
+ Writes from the coordinator node are carried out, as they are
+ carried out on the primary node in conventional DRBD. ( Write
+ to disk and send to peer simultaniously. )
+
+ Writes from the other node are sent to the coordinator first,
+ then the coordinator inserts a small "write now" packet into
+ its stram of write packets.
+ The node commits the write to its local IO subsystem as soon
+ as it gets the "write-now" packet from the coordinator.
+
+ Note: With protocol C it does not matter which node is the
+ coordinator from the performance viewpoint.
+
+ Proposed Solution 2, use ALs as distributed locks:
+
+ Only one node might mark an extent as active at a time. New
+ packets are introduced to request the locking of an extent.
+
+
+plus-banches:
+----------------------
+
+1 wait-sync-target
+
+2 Implement the checksum based resync.
+
+3 3 node support. Do and test a 3 node setup (2nd DRBD stacked over
+ a DRBD pair). Enhance the user level tools to support the 3 node
+ setup.
+
+4 Change the bitmap code to work with unmapped highmem pages, instead
+ of using vmalloc()ed memory. This allows users of 32bit platforms
+ to use drbd on big devices (in the ~3TB range)
+
+5 Support for variable sized meta data (esp bitmap) = Support for more
+ than 4TB of storage.
+
+6 Support to pass LockFS calls / make taking of snapshots possible (?)
More information about the drbd-cvs
mailing list