[DRBD-cvs] svn commit by phil - r3060 - in branches/drbd-8.2: . user - svnp run.

drbd-cvs at lists.linbit.com drbd-cvs at lists.linbit.com
Fri Sep 7 12:15:59 CEST 2007


Author: phil
Date: 2007-09-07 12:15:46 +0200 (Fri, 07 Sep 2007)
New Revision: 3060

Modified:
   branches/drbd-8.2/
   branches/drbd-8.2/ChangeLog
   branches/drbd-8.2/ROADMAP
   branches/drbd-8.2/user/drbdmeta.c
   branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c
Log:
svnp run.



Property changes on: branches/drbd-8.2
___________________________________________________________________
Name: propagate:at:3
   - 3030
   + 3059

Modified: branches/drbd-8.2/ChangeLog
===================================================================
--- branches/drbd-8.2/ChangeLog	2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/ChangeLog	2007-09-07 10:15:46 UTC (rev 3060)
@@ -4,6 +4,22 @@
  Cumulative changes since last tarball.
  For even more detail, use "svn log" and "svn diff".
 
+8.1.0 (api:86/proto:86)
+--------
+ * NOT YET RELEASED
+ * this branch
+   * will not receive new features compared to 8.0,
+   * will receive any bugfixes done on 8.0.x,
+   * will receive any code changes necessary for upstream kernel inclusion
+     as done in git://git.drbd.org/home/git/linux-drbd.git for-linus,
+     while staying - in svn - compatible with older kernel versions.
+     Reference kernel versions we won't break compatibility with are
+     vanilla 2.6.16 and later, SLES9 and later, RHEL4 and later.
+     That should be sufficient to be "general linux-2.6" compatible.
+
+Changelog for fixes propagated from 8.0.x:
+------------------------------------------
+
 8.0.5 (api:86/proto:86)
 --------
  * Changed the default behaviour of the init script. Now the init

Modified: branches/drbd-8.2/ROADMAP
===================================================================
--- branches/drbd-8.2/ROADMAP	2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/ROADMAP	2007-09-07 10:15:46 UTC (rev 3060)
@@ -1,941 +1,12 @@
-DRBD 0.8 Roadmap
-----------------
+DRBD kernel inclusion roadmap
+-----------------------------
 
-1 Drop support for linux-2.4.x.
-  Do all size calculations on the base of sectors (512 Byte) as it
-  is common in Linux-2.6.x.
-  (Currently they are done on a 1k base, for 2.4.x compatibility)
-  90% DONE
+1. fix coding style issues
 
-2 Drop the Drbd_Parameter_Packet.
-  Replace the Drbd_Parameter_Packet by 4 small packets:
-  Protocol, GenCnt, Sizes and State.
-  The receiving code of these small packets is sane, compared
-  to that huge receive_params() function we had before.
-  40% DONE
+2. fix remaining macro issues
 
-3 Authenticate the peer upon connect by using a shared secret.
-  Configuration file syntax:  net { cram-hmac-alg "sha1";
-  shared-secret "secret-word"; }
-  Using a challenge-response authentication.
-  99% DONE
+3. fix handling of kernel threads
 
-4 Consolidate state changes into a central function, that makes
-  sure that the new state is valid. Replace set_cstate() with
-  a force_state() and a request_state() function. Make all
-  state changes atomic, and consolidate the many differenct
-  cstate-error states into a single "NetworkFailure" state.
-  50% DONE
+4. fix handline of work queues
 
-5 Three configuration options, to allow more fine grained definition
-  of DRBDs behaviour after a split-brain situation:
-
-  In case the nodes of your cluster nodes see each other again, after
-  an split brain situation in which both nodes where primary
-  at the same time, you have two diverged versions of your data.
-
-  In case both nodes are secondary you can control DRBD's
-  auto recovery strategy by the "after-sb-0pri" options. The
-  default is to disconnect.
-     "disconnect" ... No automatic resynchronisation, simply disconnect.
-     "discard-younger-primary"
-                      Auto sync from the node that was primary before
-                      the split brain situation happened.
-     "discard-older-primary"
-                      Auto sync from the node that became primary
-                      as second during the split brain situation.
-                      If discard-younger-primary and discard-older-primary
-		      can not find a decissions, they fall back to
-                      discard-least-changes.
-     "discard-zero-changes"
-                      Auto sync from the node that modified
-                      blocks during the split brain situation, but only
-		      if the target not did not touched a single block.
-                      If both nodes touched their data, this policy
-		      falls back to disconnect.
-     "discard-least-changes"
-                      Auto sync from the node that touched more
-                      blocks during the split brain situation.
-     "discard-node-NODENAME"
-                      Auto sync _to_ the named node.
-
-  In one of the nodes is already primary, then the auto-recovery
-  strategie is controled by the "after-sb-1pri" options.
-     "disconnect" ... always disconnect
-     "consensus"  ... discard the version of the secondary if the outcome
-                      of the "after-sb-0pri" algorithm would also destroy
-                      the current secondary's data. Otherwise disconnect.
-     "violently-as0p" Always take the decission of the "after-sb-0pri"
-                      algorithm. Even if that causes case an erratic change
-		      of the primarie's view of the data.
-	              This is only usefull if you use an 1node FS (i.e.
-		      not OCFS2 or GFS) with the allow-two-primaries
-		      flag, _AND_ you really know what you are doing.
-		      This is DANGEROUS and MAY CRASH YOUR MACHINE if you
-		      have a FS mounted on the primary node.
-     "discard-secondary"
-                      discard the version of the secondary.
-     "call-pri-lost-after-sb"
-                      Always honour the outcome of the "after-sb-2sc"
-                      algorithm. In case it decides that the current
-                      secondary has the right data, it tries to make
-                      the current primary secondary, if that fails
-                      it calls the "pri-lost-after-sb" helper program
-                      on the current primary. That helper program is
-                      expected to halt the machine.
-
-  In case both nodes are primary you control DRBD's strategy by
-  the "after-sb-2pri" option.
-     "disconnect" ... Go to StandAlone mode on both sides.
-     "violently-as0p" Always take the decission of the "after-sb-0pri"
-                      algorithm. Even if that causes case an erratic change
-		      of the primarie's view of the data.
-	              This is only usefull if you use an 1node FS (i.e.
-		      not OCFS2 or GFS) with the allow-two-primaries
-		      flag, _AND_ you really know what you are doing.
-		      This is DANGEROUS and MAY CRASH YOUR MACHINE if you
-		      have a FS mounted on the primary node.
-     "call-pri-lost-after-sb"
-	              Honor the outcome of the "after-sb-0pri" algorithm
-                      and calls the "pri-lost-after-sb" program on the
-		      other node. That helper program is expected to
-                      halt the machine.
-
-  Defaults:
-  after-sb-0pri = disconnect;
-  after-sb-1pri = disconnect;
-  after-sb-2pri = disconnect;
-
-  DRBD-07 was:
-  after-sb-0pri = discard-younger-primary;
-  after-sb-1pri = consensus;
-  after-sb-2pri = disconnect;
-
-  NB: To allow the user to resolve from such situations manually
-      the "drbdadm connect" command (this is the "drbdsetup net"
-      command) gets a short-living flag called "--discard-my-data".
-  99% DONE
-
-6 It is possible that a secondary node crashes a primary by
-  returning invalid block_ids in ACK packets. [This might be
-  either caused by faulty hardware, or by a hostile modification
-  of DRBD on the secondary node]
-
-  Proposed solution:
-
-  Have a hash table (hlist_head style), add the collision
-  member (hlist_node) to drbd_request.
-
-  Use the sector number of the drbd_request as key to the hash, each
-  drbd_request is also put into this hash table. We still use the
-  pointer as block_id.
-
-  When we get an ACK packet, we lookup the hash table with the
-  block_id, and may find the drbd_request there. Otherwise it
-  was a forged ACK.
-
-  Note: The actual key to the hash should be (sector & ~0x7).
-        See item 9 for more details.
-  99% DONE
-
-7 Handle split brain situations; Support IO fencing;
-
-  New commands:
-    drbdadm outdate r0
-
-    When the device is configured this works via an ioctl() call.
-    In the other case it modifies the meta data directly by
-    calling drbdmeta.
-
-  remove option: on-disconnect
-
-  New meta-data flag: "Outdated"
-
-  introduce:
-  disk {
-    fencing [ dont-care | resource-only | resource-and-stonith ];
-  }
-
-  handlers {
-    outdate-peer "some script";
-  }
-
-  If the disk state of the peer is unknown, drbd calls this
-  handler (yes a call to userspace from kernel space). The handler's
-  returncodes are:
-
-  3 -> peer is inconsistent
-  4 -> peer is outdated (this handler outdated it) [ resource fencing ]
-  5 -> peer was down / unreachable
-  6 -> peer is primary
-  7 -> peer got stonithed [ node fencing ]
-
-  Let us assume that we have two boxes (N1 and N2) and that these
-  two boxes are connected by two networks (net and cnet [ clinets'-net ]).
-
-  Net is used by DRBD, while heartbeat uses both, net and cnet
-
-  I know that you are talking about fencing by STONITH, but DRBD is
-  not limited to that. Here comes my understanding of how resource fencing
-  should works with DRBDv8 :
-
-   N1  net   N2
-   P/S ---  S/P     everything up and running.
-   P/? - -  S/?     network breaks ; N1 freezes IO
-   P/? - -  S/?     N1 fences N2:
-                    In the STONITH case: turn off N2.
-                    In the resource fencing case:
-                    N1 asks N2 to fence itself from the storage via cnet.
-                    HB calls "drbdadm outdate r0" on N2.
-                    N2 replies to N1 that fencing is done via cnet.
-                    The outdate-peer script on N1 returns sucess to DRBD.
-   P/D - -  S/?     N1 thaws IO
-
-  N2 got the the "Outdated" flag set in its meta-data, by the outdate
-  command.
-
-  The fencing is set to resource-only enables this behaviour. In the
-  resource-only case the outdate-peer handler should have a return
-  value of 3, 4, 5 or 6, but should not return 7.
-
-  In case "fencing" is set to "resource-and-stonith", all IO operations
-  get immediately frozen (even all currently outstanding IO operations
-  will not finish) upon loss of connection.
-
-  Then the "outdate-peer" handler is started. In this configuration
-  the outdate peer handler might return any of the documented return
-  values.
-
-  When the outdate-peer handler returns IO is resumed.
-
-  Notes:
-  * Why do we need to freeze IO in the "resource-and-stonith" case:
-      Stonith protects you when all communication pathes fail. In
-      that case both (isolated) nodes try to stonith each other.
-      If the current primary would continue to allow IO it could
-      accept transactions, but could get stonithed by the
-      currently secondary node.
-      -> Therefore others could see commited transactions that
-         would be gone after the successfull stonith operation.
-
-  * The outedate peer handler also gets called if an unconnected
-    secondary wants to become primary.
-    In other words it only may become primary when it knows that
-    the peer is outdated/inconsistent.
-
-  * We need to store the fact that the peer is outdated/inconsistent
-    in the meta-data. To allow an stand allone primary to be rebooted.
-
-  * The outdate-peer program gets two environment variables:
-    DRBD_RESOURCE the name of the DRBD-resource and DRBD_PEER
-    the host name of the peer.
-
-  99% DONE
-
-8 New command drbdmeta
-
-  We move the read_gc.pl/write_gc.pl to the user directory.
-  Make them to one C program: drbdmeta
-   -> in the future the module never creates the meta data
-      block. One can use drbdmeta to create, read and
-      modify the drbdmeta block. drbdmeta refuses to write
-      to it as long as the module is loaded (configured).
-
-  drbdsetup gets the ability to read the gc values while DRBD
-  is set up via an ioctl() call. -- drbdmeta refuses to run
-  if DRBD is configured.
-
-  drbdadm is the nice front end. It always uses the right
-  back end (drbdmeta or drbdsetup)...
-
-  drbdadm set-gi 1:2:3:4:5:6 r0
-  drbdadm get-gi r0
-  drbdadm md-create r0
-
-  md-create would ask nasty questions about whether you are really
-  sure and so on, and do some plausibility checks first.
-  md-set would be undocumented and for wizards only.
-  80% DONE
-
-9 Support shared disk semantics  ( for GFS, OCFS etc... )
-
-    All the thoughts in this area, imply that the cluster deals
-    with split brain situations as discussed in item 6.
-
-  In order to offer a shared disk mode for GFS, we allow both
-  nodes to become primary. (This needs to be enabled with the
-  config statement net { allow-two-primaries; } )
-
- Read after write dependencies
-
-  The shared mode is available to clusters using protocol C.
-  It is not usable with protocol A or B.
-
- Global write order
-
-  [ Description of GFS-mode-arbitration2.pdf ]
-
-  1. Basic mirroring with protocol C.
-    The file system on N2 issues a write request towards DRBD,
-    which is written to the local disk and sent to N1. Then
-    the data bock is written to the local disk here and and
-    acknowledge packet is sent back. As soon as both the
-    write to the local disk and the ACK from N1 reach N2,
-    DRBD signals the completion of IO to the file system.
-
-    The major pitfall is the handling of concurrent writes to the
-    same block. (Concurrent writes to the same blocks should not
-    happen, but we have to assume that it is possible that the
-    synchronisation methods of our upper layer [i.e. openGFS]
-    may fail.)
-
-    There are many cases in which such concurrent writes would
-    lead to different data on our two copies of the block.
-
-  *** FIXME ***
-  description of algorithm here is out of date,
-  we handle things slightly differently now in the code.
-
-  2. Concurrent writes, network latency is lower than disk latency
-    As we can see on the left side in figure two this could lead
-    to N1 has the blue version (=data from FS on N2) while N2
-    ends with having the green version (=data from FS on N1).
-    The solution is to flag one node (in the example N2 has the
-    discard-concurrent-writes-flag).
-    As we can see on the right side, now both nodes ends with
-    the blue data.
-
-  3. Concurrent writes, high latency for data packets.
-    The problem now is that N2 does can not detect that this was
-    a concurrent write, since it got the ACK before the conflicting
-    data packets comes in.
-    This can happens since in DRBD, data packets and ACK packets are
-    transmitted via two independent TCP connections, therefore the
-    ACK packet can overtakes a data packet.
-    The solution is to send with the ACK packet a discard info packet,
-    which identifies the data packet by it sequence number.
-    N2 will keep this discard info as long as it has not seen higher
-    sequence numbers by now.
-    With this both nodes will end with the blue data.
-
-  4. Concurrent writes, high latency for data packets.
-    This is the inverse case to case3 and already handled by the means
-    introduced with item 1.
-
-  5. New write while processing a write from the peer.
-    Without further measures this would lead to an inconsistency in
-    our mirror as the figure on the left side shows.
-    If we currently write a conflicting block from the peer, we simply
-    discard the write request from our FS and signal IO completion
-    immediately.
-
-  6. High disk latency on N2.
-    By IO reordering in the layers below us this could lead to
-    having the blue data on N2 and the green data on N1.
-    The solution to this case is the delay the write to the local
-    disk on N2 until the local write is done. This is different from
-    case two since we already got the write ACK to the conflicting
-    block.
-
-  7. An data packet overtakes an ACK packet on the network.
-    Although this case is quite unlikely, we have to take int into
-    account. From N2's point of fiew this looks a lot like case 4,
-    but N2 should not delete the data packet now!
-
- Proposed solution
-
-  We arbitrary select one node (e.g. the node that did the first
-  accept() in the drbd_connect() function) and mark it withe the
-  discard-concurrent-writes-flag.
-
-  Each data packet and each ACK packet gets a sequence
-  number, which is increased which every packet sent.
-  (This is a common space of sequence numbers)
-
-  The algorithm which is performed upon the reception of a
-  data packet [drbd_receiver].
-
-  *  If the sequence number of the data packet is higher than
-     last_seq+1 sleep until last_seq+1 == seq_num(data packet)
-     [needed to satisfy example case 7]
-
-  1. If the packet's sequence number is on the discard list,
-     simply drop it.
-     [ ex.c. 3]
-  2. Do we have a concurrent request? (i.e. Do I have a request
-     to the same block in my transfer log.) If not -> write now.
-     [ default ]
-  3. Have I already got an ACK packet for the concurrent
-     request ? (Has the request the RQ_DRBD_SENT bit already set)
-     If yes -> write the data from the data packet afterwards.
-     [ ex.c. 6]
-  4. Do I have the "discard-concurrent-write-flag" ?
-     If yes -> discard the data packet.
-     If no -> Write data from the data packet afterwards and set
-              the RQ_DRBD_SENT bit in the request object ( Since
-              will will not get an ACK from our peer). Mark the
-	      ee to prepend the ACK packet with a discard info
-	      packet.
-     [ ex.c. *]
-
-  The algorithm which is performed upon the reception of an
-  ACK packet [drbd_asender]
-
-  * If we get an ACK, store the sequence number in last_seq.
-
-  The algorithm which is performed upon the reception of an
-  discard info packet [drbd_asender]
-
-  * if the current last_seq is lower the the packet that should
-    be discarded, store it in the to discard list.
-
-  BTW, each time we have a concurrent write access, we print
-  a warning to the syslog, since this indicates that the layer
-  above us is broken!
-
-  Note: In Item 6 we created a hash table over all requests in the
-        transfer log, keyed with (sector & ~0x7). This allows us
-        to find IO operations starting in the same 4k block of
-        data quickly. -> With two lookups the hash table we can
-	find any concurrent access.
-  99% DONE
-
-10 Change Sync-groups to sync-after
-
-  Sync groups turned out to be hard to configure and more
-  complex setups, hard to implement right and last not least they
-  are not flexible enough to cover all real world scenarios.
-
-  E.g. Two physical disks should be mirrored with DRBD. On one
-       of the disks there is only a single partition, while the
-       other one is divided into many (e.g. 4 smaller) partitions.
-       One would want to sync the big one in parallel to the
-       4 small ones. While the resync process of the 4 small
-       ones need to be serialized.
-       -> With the current sync groups you can not express
-          this requirement.
-
-  Remove config options   syncer { group <number>; }
-  Introduce config options   syncer { after <resource>; }
-  99% DONE
-      Finished the implementation. Tested.
-
-11 Take into account that the two systems could have different
-  PAGE_SIZE.
-
-  At least we should negotiate the PAGE_SIZE used by the peers,
-  and use it. In case the PAGE_SIZE is not the same inform
-  the user about the fact.
-
-  Probably a general high performance implementation for this
-  issue is not necessary, since clusters of machines with
-  different PAGE_SIZE are of academic interest only.
-  100% DONE by item 15
-
-12 Introduce a "common" section in the config file. Option
-  section (like handlers, startup, disk, net and syncer)
-  are inherited from the common section, if they are not
-  defined in a resource section.
-  99% DONE
-
-13 Introduce an UUID (universally unique identifier) in the
-  meta data. One purpose is to tag the bitmap with this UUID.
-  If the peer's UUID is different to what we expect we know that
-  we have to do a full sync....
-  99% DONE
-  -> Will be go out again, and become replaced by UUID for data
-     generations. See item 16
-
-14 Sanitize ioctls to inlcude a standard device information struct
-  at the beginning, including the expected API version.
-  Consider using DRBD ioctls with some char device similar to
-  /dev/mapper/control
-
-  The new interface is now based on netlink (actually connector).
-  It is based on the concept of tag lists. The idea is that on the
-  interface we pass lists (actually arrays) of tags. Where each
-  tag identifies the following sniplet of data.
-  Each tag also states if it is mandatory.
-
-  In case we have to add a new value to the interface, the
-  existing userland tools continue to work with newer kernel
-  modules and vice versa. (Only the older part of the two will
-  inform the user with a warning, that there was a unknown
-  tag on the interface, and that the unknown tag got ignored)
-  But the basic functionality stays intact!
-
-  While implementing this, we also implemented dynamic device allocation.
-
-  drbdsetup is basically call compatible to its ioctl based
-  ancestor, but has two now options:
-
-    --create-device ___ create the device in case int does not exist yet.
-    --set-defaults ____ set all not mentioned options to it's default values.
-
-  Things to do:
-
-  * Locking in userspace, to prevent multiple instances of drbdsetup
-  * Think about locking in kernel space ( device_mutex? )
-
-  80% DONE
-
-15 Accept BIOs bigger than one page, probabely up to 32k (8 pages)
-  currently.
-  * Normal Requsts. -> DONE
-  * Make the syncer to commulate adjacent bits into bigger requests. -> DONE
-  * Make the bitmap more coarse grained. -> TODO
-  66% DONE
-
-16 Displace the current generation-counters with a data-generation-UUID
-   concept.
-  The current generation counters have various weaknesses:
-   * In a split braine'd cluster the appliance of the same events
-     to both cluster nodes could lead to equal generation-counters
-     on both nodes, while the data is not in sync for sure.
-   * They are completely unsuitable if a 3rd node is used for
-     e.g. weekly snapshots.
-   * Gracefull takeover while disconnected not possible.
-
-  We associate each data generation with an unique UUID (=64 bit random
-  number). A new data generation is created if a primary node is
-  disconnected from its secondary and when a degraded secondary
-  becomes primary for the first time.
-
-  In the meta-data we store a few generations-UUIDs:
-   * current
-   * bitmap
-   * history[2]
-
-  As well as the currently known flags:
-   Consistent, WasUpToDate, LastState, ConnectedInd, WantFullSync
-
-  When the cluster is in Connected state, then the bitmpat gen-UUID
-  is set to 0 (Since the Bitmap is empty). When we create a new current
-  gen-UUID while we are disconencted the (old) current gets backed-up
-  to the bitmap gen-UUID. (This allowes us to identify the base of
-  of the bitmap later)
-
-  Special UUID values:
-  JustCreated [JC] ___  4
-
-  ALGORITHMS
-
-  Upon Connect:
-      self   peer   action
-  1.  C=JC   C=JC   No Sync
-  2.  C=JC   C!=JC  I am SyncTarget setting BM
-  3. C!=JC   C=JC   I am SyncSource setting BM
-  4.   C   =   C    Common power [off|failure](Examine the roles at crash time)
-  4.1  sec   sec    Common power off, no sync.
-  4.2  pri   sec    Common power failure, I am SyncSource using BM
-  4.3  sec   pri    Common power failure, I am SyncTarget using BM
-  4.4  pri   pri    Common power failure, resync in arbitrary direction.
-  5.   C   =   B    I am SyncTarget using BM
-  6.   C   = H1|H2  I am SyncTarget setting BM
-  7.   B   =   C    I am SyncSource using BM
-  8. H1|H2 =   C    I am SyncSource setting BM
-  9.   B   =   B    [ and B != 0 ] SplitBrain, try auto recover strategies.
-  10 H1|H2 = H1|H2  SplitBrain, disconnect.
-  11.               Warn about unrelated Data, disconnect.
-
-  Upon Disconnect:
-   Primary:
-      Copy the current-UUID over to the bitmap-UUID, create a new
-      current-UUID.
-   Secondary:
-      Nothing to do.
-
-  Upon becomming Primary:
-   In case we are disconnected and the bitmap-UUID is emptry, copy the
-   current-UUID over to the bitmap-UUID and create a new current-UUID.
-   Special-case: primary with --do-what-I-say, clearing the inconsistent
-                 flag causes a new UUID to be generated.
-
-  Upon start of resync:
-   Clear the consistent-flag on the SyncTarget. Generate a new UUID for
-   the bitmap-UUID of the SyncSource and the current-UUID of the SyncTarget.
-
-  Upon finish of resync:
-   Set the bitmap-UUID to 0. The SyncTarget addopts the current-UUID
-   of the SyncSource, and sets its consistent-flag.
-
-  When the bitmap-UUID gets cleared, move the previous value to H1.
-  In case H1 was already set copy its previous value to H2. Etc..
-
-  For the auto recover strategies after split brain (see item 5)
-  it is neccessary to embedd the node's role into the UUIDs.
-  This is masked out of course when the UUIDs are compared.
-
-  * Note1: Discontinue the --human and --timout options when
-           becoming primary.
-           NB: If they are needed, I think they can be implemented
-               as special UUID values.
-
-  99% DONE. Kernel part is implemented, userland parts are implemented,
-	    --humand and --timeout-expired are removed.
-	    Everything seems to work so far.
-
-  Known issues: we have to define behaviour for two-primaries case,
-  and for connection loss when Primary with local disk != UpToDate.
-
-17 Something like
-
-   drbdx: WARNING disk sizes more than 10% different
-
-  would be nice at (initial) full sync.
-  drbdx: WARNING disk sizes more than 10% different
-
-18 Connection-Teardown Packet. Currently the new state-checks
-  disallows "drbdadm disconnect res" on the primary node of a
-  connected cluster.
-  Thes Teardown Packet causes the secondary-node to outdate
-  its data and to close the connection in one go.
-  99% DONE.
-
-19 Make the updates to the bitmap transactional. Esp for resizing.
-  Make updates to the superblock transactional
-
-20 There are quite a number of parameters that must be set equal
-   (or some reciprocal) on the two nodes. We need to ensure that
-   the config is valid, from a viewpoint of the whole cluster.
-   E.g.
-   protocol				equal
-   after-sb-0pri / discard-local/remote	equal / reciprocal
-   after-sb-1pri			equal
-   after-sb-2pri			equal
-   want_lose				reciprocal
-   two_primaries			equal
-  99% DONE
-
-21 Write barriers in the kernel
-  In Linux-2.6 write barriers in the block-io layer are represented as
-  REQ_SOFTBARRIER, REQ_HARDBARRIER and REQ_NOMERGE flags on requests.
-  In the BIO layer this is BIO_RW_BARRIER, which is usually set on
-  BIO_RW (=write) requests.
-
-  The REQ_HARDBARRIER bit is currently used to do a cache flush on
-  IDE devices. Actually not all IDE devices can do cache flushes, there
-  are some older models out there that can do write-caching but can
-  not perform a cache flush!
-
-  Journaling file systems should use this barrier mechanism in their journal
-  writes (actually on the commit block, this is the last write in a
-  transactional updated to the jouernal).
-
-  As for DRBD we should probabely ship the REQ_HARDBARRIER flags with
-  our wire protocol (or should they be expressed by Barrier packets?)
-
-  We will only see such REQ_HARDBARRIER flags if we state to the upper layers
-  that we are able to deal with them. We need to do this by announcing it:
-  blk_queue_ordered(q, QUEUE_ORDERED_FLUSH or QUEUE_ORDERED_TAG ) .
-  Default ist QUEUE_ORDERED_NONE. This is the reason why we never see
-  the REQ_HARDBARRIER flag currently.
-
-  An other consequence of this is, that IDE devices that do _not_ support
-  cache flushes and have write cache enabled are inherent buggy to use with
-  a journaled file system.
-
-  SCSI's Tagged queuing (seems to be presenet in SATA as well)
-    [excerpt from http://www.scsimechanic.com/scsi/SCSI2-07.html]
-
-    Tagged queuing allows a target to accept multiple I/O processes from
-    the same or different initiators until the logical unit's command queue
-    is full.
-
-    If only SIMPLE QUEUE TAG messages are used, the target may execute the
-    commands in any order that is deemed desirable within the constraints
-    of the queue management algorithm specified in the control mode page
-    (see 8.3.3.1).
-
-    If ORDERED QUEUE TAG messages are used, the target shall execute the
-    commands in the order received with respect to other commands received
-    with ORDERED QUEUE TAG messages. All commands received with a SIMPLE
-    QUEUE TAG message prior to a command received with an ORDERED QUEUE
-    TAG message, regardless of initiator, shall be executed before that
-    command with the ORDERED QUEUE TAG message. All commands received with
-    a SIMPLE QUEUE TAG message after a command received with an ORDERED
-    QUEUE TAG message, regardless of initiator, shall be executed after
-    that command with the ORDERED QUEUE TAG message.
-
-    A command received with a HEAD OF QUEUE TAG message is placed first in
-    the queue, to be executed next. A command received with a HEAD OF
-    QUEUE TAG message shall be executed prior to any queued I/O
-    process. Consecutive commands received with HEAD OF QUEUE TAG messages
-    are executed in a last- in-first-out order.
-
-  I think in the context of SCSI the kernel usually issues write requests
-  with the SIMPLE QUEUE TAG, and requests with the REQ_HARDBARRIER
-  (i.e. bio's with the BIO_RW_BARRIER) with an ORDERED QUEUE TAG.
-
-  What QUEUE_ORDERED_ type should we expose ?
-
-    In order to support capable IDE devices right, we should ship the
-    BIO_RW_BARRIER bit with our data packets in case the peer's backing
-    storage is of the QUEUE_ORDERED_FLUSH type.
-
-    If both devices are of the QUEUE_ORDERED_TAG type should also claim
-    to be of that type, and ship the BIO_RW_BARRIER bit as well.
-
-    self   peer      DRBD
-    ---------------------
-    NONE , NONE  =>  NONE
-    NONE , FLUSH =>  NONE
-    NONE , TAG   =>  NONE
-    FLUSH, NONE  =>  NONE
-    FLUSH, FLUSH =>  FLUSH
-    FLUSH, TAG   =>  FLUSH
-    TAG,   NONE  =>  NONE
-    TAG,   FLUSH =>  FLUSH
-    TAG,   TAG   =>  TAG
-
-  How should we deal with our self generated barrier packets ?
-
-    In case our backing device is of the QUEUE_ORDERED_NONE class, we
-    have to stay with the current code.
-
-    In case our backing device only supports QUEUE_ORDERED_FLUSH we
-    will to use the current code. That means, when we receive a write
-    barrier packet we wait until all of our pending local write
-    requests are done. (This potentially causes congestion on the TCP
-    socket...)
-
-    In cause our backing device's queue properties are set to
-    QUEUE_ORDERED_TAG we offload the complete barrier logic to the
-    backing storage device:
-
-    * When we receive a barrier packet
-      - If we have no local pending requests, we send the barrier ACK
-        immediately. (= current code)
-      - If the last_barrier_write member of mdev points to an epoch_entry
-        we set bit 31 of bnum.
-      - If we have local pending requests, we set a flag that the next
-        data packet has to be written with the BIO_RW_BARRIER flag.
-        (That flag should be called BARRIER_NEEDED)
-
-    * When receiving data packets we test_and_clear BARRIER_NEEDED,
-      and add set the BIO_RW_BARRIER on the write request. We also set
-      the last_barrier_write member of mdev.
-      [Normal writes clear the last_barrier_write member of mdev]
-
-    * When a write completes and it has the bnum set, send the barrier
-      ack before sending the ack for the write. In case the highest
-      bit of bnum is set as well, also send the barrier ack following
-      the write ack of the data packet.
-
-  90% DONE [ Not tested yet. ]
-
-22 Reboot notifier.
-
-23 External imposed SyncPause states.
-   There are two new commands: 'drbdadm pause-sync res'
-                               'drbdadm resume-sync res'
-   These may be used to suspend the resynchronisation process while
-   e.g. the backing storages' raid controller does its resynchronisation.
-
-   While implementing this, I also made shure that in a 3 node
-   setup the two peers of a connection will agree if a resynchronisation
-   is paused under all conditions you can think of, if there are more
-   than two nodes!
-
-   99% DONE
-
-24 Make it possible to hot-add disk drives == Atomic configuration changes.
-
-   99% DONE
-
-25 Add reserved fields to DRBD-meta-data, add a bytes per bit field to
-   metadata.
-
-   99% DONE
-
-26 Implement a kind of "dstate" command to make integration with
-   Heartbeat-2.0's master/slave-support possible.
-
-   99% DONE
-
-27 Remove all explicit drbd_md_write() calls, and create a mechanism,
-   that always keeps the on disk-metadata up-to-date implicit.
-   Calling drbd_md_write() explicit is too errorprone.
-
-   99% DONE
-
-28 Implement a kind of 'call home', a single HTTP get request, that
-   gets counted in a data base. The initiator calculates a simple
-   hash over the machine and resource names. Each time a meta-data
-   set gets generated, the 'call home' is initiated. The user might
-   of course opt out of this.
-
-   99% DONE
-
-29 Make drbdadm to have 'hidden-commands' command to also show
-   the hidden sub-commands in the ussage.
-
-   99% DONE
-
-30 The current drbdadm_scanner is 1MB in source and as binary.
-   Use a _basic_ flex scanner, and a hand written parser for superb
-   errror reporting.
-
-   99% DONE
-
-31 Resizing several GB results in ko-count timeouts, maybe since the
-   secondary node does the enlargement of the bitmap in the receiver (?)
-
-   DONE, by using the async bitmap IO code.
-
-32 drbdmeta: with internal meta-data v07 and v08 meta-data super blocks
-   are in different places. -> It is possible to have v07 AND v08 meta
-   data on one device.
-   => drbdmeta should make sure that it overwrites the other location
-      in case it create a meta-data block.
-
-   99% DONE
-
-33 Serialize state changes like secondary -> primary and
-   Connected -> SyncSource in the cluster.
-
-      role <- primary
-      conn <- StartingSyncT (disk <- inconsistent)
-      conn <- StartingSyncS (pdsk <- inconsistent)
-      disk <- Diskless (as long as it happens as administrative command)
-      pdsk <- Outdated (= a 'disconnect' issued on a primary node)
-
-   * When a state change might sleep ( reuqest_state() ) and it is
-     to be cluster wide atomic ( pre_state_checks() determines this!).
-	1. Aquire the cluster state change lock (bit & waitqueue) ?
-	2. We send a request_state packet.
-
-   * When a request_state packet is received
-
-	1. * If we are UNIQUE we take the cluster lock (potentially
-	     waiting for it) and try to apply the remote's request
-	     as soon as we have the lock.
-	   * When we are not UNIQUE we try to apply the state change
-	     immediately (without taking the cluster lock).
-	2. We send the ACK / NACK.
-	   ( Do we actually need an ACK/NACK ?
- 		* On the not UNIQUE side, we will fail the request as
-		  soon as the offending state request comes in.
-                * On the UNIQUE side we need to positive ACK to
-		  continue.
-		) I guess for the sake of completeness, we should
-                  have both packets, although currently the need for
-		  the NACK packet is not abvious.
-
-   * When we receive an ACK / NACK we either sucessfully finish or
-     fail the the request_state() call. (Error codes should be passed
-     from the peer.)
-
-   * When the connection failes ( = actually a non-cluster wide state
-     change happens while a cluster wide state change goes on), we
-     need to re-evaluate the pre state change check. In case the
-     pre state change check allows the new state we can procees,
-     otherwise we need to fail the request.
-
-   * How to do the synchronisation form the receive of the ACK / NACK
-     packet to the termination of the request_state() function ?
-       * wait_queue & bit.
-
-   DATA STRUCTURES:
-	* A CLUSTER_STATE_CHANGE bit == the cluster lock bit.
-	* A CL_ST_CHG_SUCCESS  bit set by the receiver.
-	* A CL_ST_CHG_FAIL     bit set by the receiver.
-	* A wait queue.
-
-   TODOS:
-    Evaluate if it is possible to use it for starting resync. (invalidate)
-    Evaluate it for the other cases...
-
-  90 % Is implemented. Changing the role to primary already uses this
-       mechanism. Starting resync with invalidate and invalidate_remote
-       now also uses this method. Detaching now also uses this mechanism.
-
-34 Improve the initial hand-shake, to identify the sockets (and TCP-
-   links) by an initial message, and not only by the connection timming.
-
-   99% DONE
-
-35 Bigger AL-extents (e.g. 16MB)
-
-36 Increase the number of UUID history slots.
-
-37 In case heartbeat (or some one else) makes us primary, we need to
-   check first if the peer is alive.
-   Currently we habe a problem is when heartbeat's dead time is smaller
-   than DRBD's network timeout.
-
-38 Create an other on-io-error hander, that does retry failed read
-   operations on the peer, but does not detach from the local disk.
-   And it sets that block in the bitmap as out-of-date.
-
-   Simon works on this.
-
-39 Send mirrored write requests out of the worker context.
-   99% DONE
-
-40 Do something with FLUSHBUFS ioctl.
-
-41 Fix DRBD's behaviour in case of a common power failuer and when
-   both nodes were in primary role.
-
-   See the the Algorithm of Item 16, section 4 to 4.4 .
-
-   Further we need to have the resync rolces conflict  "rr-conflict"
-   strategy option with the following values:
-
-   The available options are:
-     "disconnect" ... No automatic resynchronisation, simply disconnect.
-     "violently" .... Sync to the primary node is allowed, violating the
-	              assumption that data on a block device is stable
-		      for one of the nodes. DANGEROUS, DO NOT USE.
-     "call-pri-lost"
-                      Call this helper program on one of the machines.
-                      This program is expected to halt or reboot the
-                      machine.
-
-   An exception of course is a primary disk-less node that gets a disk
-   attached. Such a nodes becomes sync target, but since it does not
-   show a violently data change, this state transition is always allowed.
-
-   99% DONE
-
-42 Forward port the abilitiy to resume the TL after IO was frozen,
-   in case the connection is reestablished again.
-
-43 Fix indexed meta-data.
-
-44 Callbacks to userspace should run asynchronous.
-
-Maybe:
-
-*  Switch to protocol C in case we are running without a local
-   disk and are configured to use protocol A or B.
-
-*  Dynamic misc char device instead of IOCTLs for configuration. Evaluate
-   if the configuration could be done over a netlink socket as well...
-
-*  A netlink socket to communicate events to userspace.
-   - All state changes
-   - the need to outdate the peer
-
-*  Write some heartbeat glue to do a gracefull switchover in case of
-   a local IO failue. (requires the netlink socket thing)
-
-plus-banches:
-----------------------
-
-1 Make use-csums to use the kernel's crypto API
-
-2 Implement online verification
-
-3 Change the bitmap code to work with unmapped highmem pages, instead
-  of using vmalloc()ed memory. This allows users of 32bit platforms
-  to use drbd on big devices (in the ~3TB range)
-
-4 3 node support. Do and test a 3 node setup (2nd DRBD stacked over
-  a DRBD pair). Enhance the user level tools to support the 3 node
-  setup.
-
-5 Have protocol version 74 available in drbd-0.8, to allow rolling
-  upgrades
-
+5. fix [or discuss away ;-)] anything else brought up on lkml

Modified: branches/drbd-8.2/user/drbdmeta.c
===================================================================
--- branches/drbd-8.2/user/drbdmeta.c	2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/user/drbdmeta.c	2007-09-07 10:15:46 UTC (rev 3060)
@@ -273,9 +273,9 @@
 	struct md_cpu md;
 
 	/* _byte_ offsets of our "super block" and other data, within fd */
-	u64 md_offset;
-	u64 al_offset;
-	u64 bm_offset;
+	s64 md_offset;
+	s64 al_offset;
+	s64 bm_offset;
 	size_t md_mmaped_length;
 	size_t al_mmaped_length;
 	size_t bm_mmaped_length;
@@ -684,14 +684,14 @@
 int v07_parse(struct format *cfg, char **argv, int argc, int *ai);
 int v07_md_initialize(struct format *cfg);
 void v07_md_erase_others(struct format *cfg);
-u64 v07_md_get_byte_offset(struct format * cfg);
+s64 v07_md_get_byte_offset(struct format * cfg);
 
 int v08_md_open(struct format *cfg);
 int v08_md_cpu_to_disk(struct format *cfg);
 int v08_md_disk_to_cpu(struct format *cfg);
 int v08_md_initialize(struct format *cfg);
 void v08_md_erase_others(struct format *cfg);
-u64 v08_md_get_byte_offset(struct format * cfg);
+s64 v08_md_get_byte_offset(struct format * cfg);
 
 struct format_ops f_ops[] = {
 	[Drbd_06] = {
@@ -881,7 +881,7 @@
 }
 
 int v07_style_md_open(struct format *cfg,
-		      u64 (*md_get_byte_offset) (struct format *),
+		      s64 (*md_get_byte_offset) (struct format *),
 		      size_t size)
 {
 	struct stat sb;
@@ -905,7 +905,7 @@
 		exit(20);
 	}
 
-	if (ioctl(cfg->md_fd, BLKFLSBUF) == -1) {
+	if (ioctl(cfg->md_fd, BLKFLSBUF, NULL) == -1) {
 		PERROR("WARN: ioctl(,BLKFLSBUF,) failed");
 	}
 
@@ -938,7 +938,7 @@
 	// For the case that someone modified la_sect by hand..
 	if( (cfg->md_index == DRBD_MD_INDEX_INTERNAL ||
 	     cfg->md_index == DRBD_MD_INDEX_FLEX_INT ) &&
-	    (cfg->md.la_sect*512 > cfg->md_offset) ) {
+	    (cfg->md.la_sect*512 > (u64)cfg->md_offset) ) {
 		printf("la-size-sect was too big, fixed.\n");
 		cfg->md.la_sect = cfg->md_offset/512;
 	}
@@ -972,7 +972,7 @@
 }
 
 void md_erase_sb(struct format *cfg,
-		 u64 (*md_get_byte_offset) (struct format *))
+		 s64 (*md_get_byte_offset) (struct format *))
 {
 	/* in case these are internal meta data, we need to
 	   make sure that there is no v08 superblock at the end
@@ -980,7 +980,7 @@
 
 	unsigned char zero_sector[512];
 	struct format cfg_f;
-	u64 offset;
+	s64 offset;
 	int bw;
 
 	if(cfg->md_index == DRBD_MD_INDEX_INTERNAL ||
@@ -992,6 +992,8 @@
 		   in the front of the meta data area. */
 
 		offset = md_get_byte_offset(&cfg_f);
+		if (offset < 0)
+			return;
 		if(lseek64(cfg->md_fd, offset, SEEK_SET) == -1) {
 			PERROR("lseek64() failed");
 			exit(20);
@@ -1404,9 +1406,9 @@
  begin of v07 {{{
  ******************************************/
 
-u64 v07_md_get_byte_offset(struct format *cfg)
+s64 v07_md_get_byte_offset(struct format *cfg)
 {
-	u64 offset;
+	s64 offset;
 
 	switch(cfg->md_index) {
 	default: /* external, some index */
@@ -1509,7 +1511,7 @@
 		PERROR("fsync() failed");
 		err = -1;
 	}
-	if (ioctl(cfg->md_fd, BLKFLSBUF) == -1) {
+	if (ioctl(cfg->md_fd, BLKFLSBUF, NULL) == -1) {
 		PERROR("ioctl(,BLKFLSBUF,) failed");
 		err = -1;
 	}
@@ -1545,9 +1547,9 @@
  begin of v08 {{{
  ******************************************/
 
-u64 v08_md_get_byte_offset(struct format *cfg)
+s64 v08_md_get_byte_offset(struct format *cfg)
 {
-	u64 offset;
+	s64 offset;
 
 	switch(cfg->md_index) {
 	default: /* external, some index */

Modified: branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c
===================================================================
--- branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c	2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c	2007-09-07 10:15:46 UTC (rev 3060)
@@ -2028,7 +2028,7 @@
 		DRBD_MD_INDEX_FLEX_INT, cfg->bd_size);
 
 	printf("%lld\n%lld\n%lld\n", cfg->bd_size, fixed_offset, flex_offset);
-	if (fixed_offset < (off_t)cfg->bd_size - 4096) {
+	if (0 <= fixed_offset && fixed_offset < (off_t)cfg->bd_size - 4096) {
 		/* ... v07 fixed-size internal meta data? */
 		PREAD(cfg->md_fd, on_disk_buffer, 4096, fixed_offset);
 	



More information about the drbd-cvs mailing list