[DRBD-cvs] svn commit by phil - r3060 - in branches/drbd-8.2: .
user - svnp run.
drbd-cvs at lists.linbit.com
drbd-cvs at lists.linbit.com
Fri Sep 7 12:15:59 CEST 2007
Author: phil
Date: 2007-09-07 12:15:46 +0200 (Fri, 07 Sep 2007)
New Revision: 3060
Modified:
branches/drbd-8.2/
branches/drbd-8.2/ChangeLog
branches/drbd-8.2/ROADMAP
branches/drbd-8.2/user/drbdmeta.c
branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c
Log:
svnp run.
Property changes on: branches/drbd-8.2
___________________________________________________________________
Name: propagate:at:3
- 3030
+ 3059
Modified: branches/drbd-8.2/ChangeLog
===================================================================
--- branches/drbd-8.2/ChangeLog 2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/ChangeLog 2007-09-07 10:15:46 UTC (rev 3060)
@@ -4,6 +4,22 @@
Cumulative changes since last tarball.
For even more detail, use "svn log" and "svn diff".
+8.1.0 (api:86/proto:86)
+--------
+ * NOT YET RELEASED
+ * this branch
+ * will not receive new features compared to 8.0,
+ * will receive any bugfixes done on 8.0.x,
+ * will receive any code changes necessary for upstream kernel inclusion
+ as done in git://git.drbd.org/home/git/linux-drbd.git for-linus,
+ while staying - in svn - compatible with older kernel versions.
+ Reference kernel versions we won't break compatibility with are
+ vanilla 2.6.16 and later, SLES9 and later, RHEL4 and later.
+ That should be sufficient to be "general linux-2.6" compatible.
+
+Changelog for fixes propagated from 8.0.x:
+------------------------------------------
+
8.0.5 (api:86/proto:86)
--------
* Changed the default behaviour of the init script. Now the init
Modified: branches/drbd-8.2/ROADMAP
===================================================================
--- branches/drbd-8.2/ROADMAP 2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/ROADMAP 2007-09-07 10:15:46 UTC (rev 3060)
@@ -1,941 +1,12 @@
-DRBD 0.8 Roadmap
-----------------
+DRBD kernel inclusion roadmap
+-----------------------------
-1 Drop support for linux-2.4.x.
- Do all size calculations on the base of sectors (512 Byte) as it
- is common in Linux-2.6.x.
- (Currently they are done on a 1k base, for 2.4.x compatibility)
- 90% DONE
+1. fix coding style issues
-2 Drop the Drbd_Parameter_Packet.
- Replace the Drbd_Parameter_Packet by 4 small packets:
- Protocol, GenCnt, Sizes and State.
- The receiving code of these small packets is sane, compared
- to that huge receive_params() function we had before.
- 40% DONE
+2. fix remaining macro issues
-3 Authenticate the peer upon connect by using a shared secret.
- Configuration file syntax: net { cram-hmac-alg "sha1";
- shared-secret "secret-word"; }
- Using a challenge-response authentication.
- 99% DONE
+3. fix handling of kernel threads
-4 Consolidate state changes into a central function, that makes
- sure that the new state is valid. Replace set_cstate() with
- a force_state() and a request_state() function. Make all
- state changes atomic, and consolidate the many differenct
- cstate-error states into a single "NetworkFailure" state.
- 50% DONE
+4. fix handline of work queues
-5 Three configuration options, to allow more fine grained definition
- of DRBDs behaviour after a split-brain situation:
-
- In case the nodes of your cluster nodes see each other again, after
- an split brain situation in which both nodes where primary
- at the same time, you have two diverged versions of your data.
-
- In case both nodes are secondary you can control DRBD's
- auto recovery strategy by the "after-sb-0pri" options. The
- default is to disconnect.
- "disconnect" ... No automatic resynchronisation, simply disconnect.
- "discard-younger-primary"
- Auto sync from the node that was primary before
- the split brain situation happened.
- "discard-older-primary"
- Auto sync from the node that became primary
- as second during the split brain situation.
- If discard-younger-primary and discard-older-primary
- can not find a decissions, they fall back to
- discard-least-changes.
- "discard-zero-changes"
- Auto sync from the node that modified
- blocks during the split brain situation, but only
- if the target not did not touched a single block.
- If both nodes touched their data, this policy
- falls back to disconnect.
- "discard-least-changes"
- Auto sync from the node that touched more
- blocks during the split brain situation.
- "discard-node-NODENAME"
- Auto sync _to_ the named node.
-
- In one of the nodes is already primary, then the auto-recovery
- strategie is controled by the "after-sb-1pri" options.
- "disconnect" ... always disconnect
- "consensus" ... discard the version of the secondary if the outcome
- of the "after-sb-0pri" algorithm would also destroy
- the current secondary's data. Otherwise disconnect.
- "violently-as0p" Always take the decission of the "after-sb-0pri"
- algorithm. Even if that causes case an erratic change
- of the primarie's view of the data.
- This is only usefull if you use an 1node FS (i.e.
- not OCFS2 or GFS) with the allow-two-primaries
- flag, _AND_ you really know what you are doing.
- This is DANGEROUS and MAY CRASH YOUR MACHINE if you
- have a FS mounted on the primary node.
- "discard-secondary"
- discard the version of the secondary.
- "call-pri-lost-after-sb"
- Always honour the outcome of the "after-sb-2sc"
- algorithm. In case it decides that the current
- secondary has the right data, it tries to make
- the current primary secondary, if that fails
- it calls the "pri-lost-after-sb" helper program
- on the current primary. That helper program is
- expected to halt the machine.
-
- In case both nodes are primary you control DRBD's strategy by
- the "after-sb-2pri" option.
- "disconnect" ... Go to StandAlone mode on both sides.
- "violently-as0p" Always take the decission of the "after-sb-0pri"
- algorithm. Even if that causes case an erratic change
- of the primarie's view of the data.
- This is only usefull if you use an 1node FS (i.e.
- not OCFS2 or GFS) with the allow-two-primaries
- flag, _AND_ you really know what you are doing.
- This is DANGEROUS and MAY CRASH YOUR MACHINE if you
- have a FS mounted on the primary node.
- "call-pri-lost-after-sb"
- Honor the outcome of the "after-sb-0pri" algorithm
- and calls the "pri-lost-after-sb" program on the
- other node. That helper program is expected to
- halt the machine.
-
- Defaults:
- after-sb-0pri = disconnect;
- after-sb-1pri = disconnect;
- after-sb-2pri = disconnect;
-
- DRBD-07 was:
- after-sb-0pri = discard-younger-primary;
- after-sb-1pri = consensus;
- after-sb-2pri = disconnect;
-
- NB: To allow the user to resolve from such situations manually
- the "drbdadm connect" command (this is the "drbdsetup net"
- command) gets a short-living flag called "--discard-my-data".
- 99% DONE
-
-6 It is possible that a secondary node crashes a primary by
- returning invalid block_ids in ACK packets. [This might be
- either caused by faulty hardware, or by a hostile modification
- of DRBD on the secondary node]
-
- Proposed solution:
-
- Have a hash table (hlist_head style), add the collision
- member (hlist_node) to drbd_request.
-
- Use the sector number of the drbd_request as key to the hash, each
- drbd_request is also put into this hash table. We still use the
- pointer as block_id.
-
- When we get an ACK packet, we lookup the hash table with the
- block_id, and may find the drbd_request there. Otherwise it
- was a forged ACK.
-
- Note: The actual key to the hash should be (sector & ~0x7).
- See item 9 for more details.
- 99% DONE
-
-7 Handle split brain situations; Support IO fencing;
-
- New commands:
- drbdadm outdate r0
-
- When the device is configured this works via an ioctl() call.
- In the other case it modifies the meta data directly by
- calling drbdmeta.
-
- remove option: on-disconnect
-
- New meta-data flag: "Outdated"
-
- introduce:
- disk {
- fencing [ dont-care | resource-only | resource-and-stonith ];
- }
-
- handlers {
- outdate-peer "some script";
- }
-
- If the disk state of the peer is unknown, drbd calls this
- handler (yes a call to userspace from kernel space). The handler's
- returncodes are:
-
- 3 -> peer is inconsistent
- 4 -> peer is outdated (this handler outdated it) [ resource fencing ]
- 5 -> peer was down / unreachable
- 6 -> peer is primary
- 7 -> peer got stonithed [ node fencing ]
-
- Let us assume that we have two boxes (N1 and N2) and that these
- two boxes are connected by two networks (net and cnet [ clinets'-net ]).
-
- Net is used by DRBD, while heartbeat uses both, net and cnet
-
- I know that you are talking about fencing by STONITH, but DRBD is
- not limited to that. Here comes my understanding of how resource fencing
- should works with DRBDv8 :
-
- N1 net N2
- P/S --- S/P everything up and running.
- P/? - - S/? network breaks ; N1 freezes IO
- P/? - - S/? N1 fences N2:
- In the STONITH case: turn off N2.
- In the resource fencing case:
- N1 asks N2 to fence itself from the storage via cnet.
- HB calls "drbdadm outdate r0" on N2.
- N2 replies to N1 that fencing is done via cnet.
- The outdate-peer script on N1 returns sucess to DRBD.
- P/D - - S/? N1 thaws IO
-
- N2 got the the "Outdated" flag set in its meta-data, by the outdate
- command.
-
- The fencing is set to resource-only enables this behaviour. In the
- resource-only case the outdate-peer handler should have a return
- value of 3, 4, 5 or 6, but should not return 7.
-
- In case "fencing" is set to "resource-and-stonith", all IO operations
- get immediately frozen (even all currently outstanding IO operations
- will not finish) upon loss of connection.
-
- Then the "outdate-peer" handler is started. In this configuration
- the outdate peer handler might return any of the documented return
- values.
-
- When the outdate-peer handler returns IO is resumed.
-
- Notes:
- * Why do we need to freeze IO in the "resource-and-stonith" case:
- Stonith protects you when all communication pathes fail. In
- that case both (isolated) nodes try to stonith each other.
- If the current primary would continue to allow IO it could
- accept transactions, but could get stonithed by the
- currently secondary node.
- -> Therefore others could see commited transactions that
- would be gone after the successfull stonith operation.
-
- * The outedate peer handler also gets called if an unconnected
- secondary wants to become primary.
- In other words it only may become primary when it knows that
- the peer is outdated/inconsistent.
-
- * We need to store the fact that the peer is outdated/inconsistent
- in the meta-data. To allow an stand allone primary to be rebooted.
-
- * The outdate-peer program gets two environment variables:
- DRBD_RESOURCE the name of the DRBD-resource and DRBD_PEER
- the host name of the peer.
-
- 99% DONE
-
-8 New command drbdmeta
-
- We move the read_gc.pl/write_gc.pl to the user directory.
- Make them to one C program: drbdmeta
- -> in the future the module never creates the meta data
- block. One can use drbdmeta to create, read and
- modify the drbdmeta block. drbdmeta refuses to write
- to it as long as the module is loaded (configured).
-
- drbdsetup gets the ability to read the gc values while DRBD
- is set up via an ioctl() call. -- drbdmeta refuses to run
- if DRBD is configured.
-
- drbdadm is the nice front end. It always uses the right
- back end (drbdmeta or drbdsetup)...
-
- drbdadm set-gi 1:2:3:4:5:6 r0
- drbdadm get-gi r0
- drbdadm md-create r0
-
- md-create would ask nasty questions about whether you are really
- sure and so on, and do some plausibility checks first.
- md-set would be undocumented and for wizards only.
- 80% DONE
-
-9 Support shared disk semantics ( for GFS, OCFS etc... )
-
- All the thoughts in this area, imply that the cluster deals
- with split brain situations as discussed in item 6.
-
- In order to offer a shared disk mode for GFS, we allow both
- nodes to become primary. (This needs to be enabled with the
- config statement net { allow-two-primaries; } )
-
- Read after write dependencies
-
- The shared mode is available to clusters using protocol C.
- It is not usable with protocol A or B.
-
- Global write order
-
- [ Description of GFS-mode-arbitration2.pdf ]
-
- 1. Basic mirroring with protocol C.
- The file system on N2 issues a write request towards DRBD,
- which is written to the local disk and sent to N1. Then
- the data bock is written to the local disk here and and
- acknowledge packet is sent back. As soon as both the
- write to the local disk and the ACK from N1 reach N2,
- DRBD signals the completion of IO to the file system.
-
- The major pitfall is the handling of concurrent writes to the
- same block. (Concurrent writes to the same blocks should not
- happen, but we have to assume that it is possible that the
- synchronisation methods of our upper layer [i.e. openGFS]
- may fail.)
-
- There are many cases in which such concurrent writes would
- lead to different data on our two copies of the block.
-
- *** FIXME ***
- description of algorithm here is out of date,
- we handle things slightly differently now in the code.
-
- 2. Concurrent writes, network latency is lower than disk latency
- As we can see on the left side in figure two this could lead
- to N1 has the blue version (=data from FS on N2) while N2
- ends with having the green version (=data from FS on N1).
- The solution is to flag one node (in the example N2 has the
- discard-concurrent-writes-flag).
- As we can see on the right side, now both nodes ends with
- the blue data.
-
- 3. Concurrent writes, high latency for data packets.
- The problem now is that N2 does can not detect that this was
- a concurrent write, since it got the ACK before the conflicting
- data packets comes in.
- This can happens since in DRBD, data packets and ACK packets are
- transmitted via two independent TCP connections, therefore the
- ACK packet can overtakes a data packet.
- The solution is to send with the ACK packet a discard info packet,
- which identifies the data packet by it sequence number.
- N2 will keep this discard info as long as it has not seen higher
- sequence numbers by now.
- With this both nodes will end with the blue data.
-
- 4. Concurrent writes, high latency for data packets.
- This is the inverse case to case3 and already handled by the means
- introduced with item 1.
-
- 5. New write while processing a write from the peer.
- Without further measures this would lead to an inconsistency in
- our mirror as the figure on the left side shows.
- If we currently write a conflicting block from the peer, we simply
- discard the write request from our FS and signal IO completion
- immediately.
-
- 6. High disk latency on N2.
- By IO reordering in the layers below us this could lead to
- having the blue data on N2 and the green data on N1.
- The solution to this case is the delay the write to the local
- disk on N2 until the local write is done. This is different from
- case two since we already got the write ACK to the conflicting
- block.
-
- 7. An data packet overtakes an ACK packet on the network.
- Although this case is quite unlikely, we have to take int into
- account. From N2's point of fiew this looks a lot like case 4,
- but N2 should not delete the data packet now!
-
- Proposed solution
-
- We arbitrary select one node (e.g. the node that did the first
- accept() in the drbd_connect() function) and mark it withe the
- discard-concurrent-writes-flag.
-
- Each data packet and each ACK packet gets a sequence
- number, which is increased which every packet sent.
- (This is a common space of sequence numbers)
-
- The algorithm which is performed upon the reception of a
- data packet [drbd_receiver].
-
- * If the sequence number of the data packet is higher than
- last_seq+1 sleep until last_seq+1 == seq_num(data packet)
- [needed to satisfy example case 7]
-
- 1. If the packet's sequence number is on the discard list,
- simply drop it.
- [ ex.c. 3]
- 2. Do we have a concurrent request? (i.e. Do I have a request
- to the same block in my transfer log.) If not -> write now.
- [ default ]
- 3. Have I already got an ACK packet for the concurrent
- request ? (Has the request the RQ_DRBD_SENT bit already set)
- If yes -> write the data from the data packet afterwards.
- [ ex.c. 6]
- 4. Do I have the "discard-concurrent-write-flag" ?
- If yes -> discard the data packet.
- If no -> Write data from the data packet afterwards and set
- the RQ_DRBD_SENT bit in the request object ( Since
- will will not get an ACK from our peer). Mark the
- ee to prepend the ACK packet with a discard info
- packet.
- [ ex.c. *]
-
- The algorithm which is performed upon the reception of an
- ACK packet [drbd_asender]
-
- * If we get an ACK, store the sequence number in last_seq.
-
- The algorithm which is performed upon the reception of an
- discard info packet [drbd_asender]
-
- * if the current last_seq is lower the the packet that should
- be discarded, store it in the to discard list.
-
- BTW, each time we have a concurrent write access, we print
- a warning to the syslog, since this indicates that the layer
- above us is broken!
-
- Note: In Item 6 we created a hash table over all requests in the
- transfer log, keyed with (sector & ~0x7). This allows us
- to find IO operations starting in the same 4k block of
- data quickly. -> With two lookups the hash table we can
- find any concurrent access.
- 99% DONE
-
-10 Change Sync-groups to sync-after
-
- Sync groups turned out to be hard to configure and more
- complex setups, hard to implement right and last not least they
- are not flexible enough to cover all real world scenarios.
-
- E.g. Two physical disks should be mirrored with DRBD. On one
- of the disks there is only a single partition, while the
- other one is divided into many (e.g. 4 smaller) partitions.
- One would want to sync the big one in parallel to the
- 4 small ones. While the resync process of the 4 small
- ones need to be serialized.
- -> With the current sync groups you can not express
- this requirement.
-
- Remove config options syncer { group <number>; }
- Introduce config options syncer { after <resource>; }
- 99% DONE
- Finished the implementation. Tested.
-
-11 Take into account that the two systems could have different
- PAGE_SIZE.
-
- At least we should negotiate the PAGE_SIZE used by the peers,
- and use it. In case the PAGE_SIZE is not the same inform
- the user about the fact.
-
- Probably a general high performance implementation for this
- issue is not necessary, since clusters of machines with
- different PAGE_SIZE are of academic interest only.
- 100% DONE by item 15
-
-12 Introduce a "common" section in the config file. Option
- section (like handlers, startup, disk, net and syncer)
- are inherited from the common section, if they are not
- defined in a resource section.
- 99% DONE
-
-13 Introduce an UUID (universally unique identifier) in the
- meta data. One purpose is to tag the bitmap with this UUID.
- If the peer's UUID is different to what we expect we know that
- we have to do a full sync....
- 99% DONE
- -> Will be go out again, and become replaced by UUID for data
- generations. See item 16
-
-14 Sanitize ioctls to inlcude a standard device information struct
- at the beginning, including the expected API version.
- Consider using DRBD ioctls with some char device similar to
- /dev/mapper/control
-
- The new interface is now based on netlink (actually connector).
- It is based on the concept of tag lists. The idea is that on the
- interface we pass lists (actually arrays) of tags. Where each
- tag identifies the following sniplet of data.
- Each tag also states if it is mandatory.
-
- In case we have to add a new value to the interface, the
- existing userland tools continue to work with newer kernel
- modules and vice versa. (Only the older part of the two will
- inform the user with a warning, that there was a unknown
- tag on the interface, and that the unknown tag got ignored)
- But the basic functionality stays intact!
-
- While implementing this, we also implemented dynamic device allocation.
-
- drbdsetup is basically call compatible to its ioctl based
- ancestor, but has two now options:
-
- --create-device ___ create the device in case int does not exist yet.
- --set-defaults ____ set all not mentioned options to it's default values.
-
- Things to do:
-
- * Locking in userspace, to prevent multiple instances of drbdsetup
- * Think about locking in kernel space ( device_mutex? )
-
- 80% DONE
-
-15 Accept BIOs bigger than one page, probabely up to 32k (8 pages)
- currently.
- * Normal Requsts. -> DONE
- * Make the syncer to commulate adjacent bits into bigger requests. -> DONE
- * Make the bitmap more coarse grained. -> TODO
- 66% DONE
-
-16 Displace the current generation-counters with a data-generation-UUID
- concept.
- The current generation counters have various weaknesses:
- * In a split braine'd cluster the appliance of the same events
- to both cluster nodes could lead to equal generation-counters
- on both nodes, while the data is not in sync for sure.
- * They are completely unsuitable if a 3rd node is used for
- e.g. weekly snapshots.
- * Gracefull takeover while disconnected not possible.
-
- We associate each data generation with an unique UUID (=64 bit random
- number). A new data generation is created if a primary node is
- disconnected from its secondary and when a degraded secondary
- becomes primary for the first time.
-
- In the meta-data we store a few generations-UUIDs:
- * current
- * bitmap
- * history[2]
-
- As well as the currently known flags:
- Consistent, WasUpToDate, LastState, ConnectedInd, WantFullSync
-
- When the cluster is in Connected state, then the bitmpat gen-UUID
- is set to 0 (Since the Bitmap is empty). When we create a new current
- gen-UUID while we are disconencted the (old) current gets backed-up
- to the bitmap gen-UUID. (This allowes us to identify the base of
- of the bitmap later)
-
- Special UUID values:
- JustCreated [JC] ___ 4
-
- ALGORITHMS
-
- Upon Connect:
- self peer action
- 1. C=JC C=JC No Sync
- 2. C=JC C!=JC I am SyncTarget setting BM
- 3. C!=JC C=JC I am SyncSource setting BM
- 4. C = C Common power [off|failure](Examine the roles at crash time)
- 4.1 sec sec Common power off, no sync.
- 4.2 pri sec Common power failure, I am SyncSource using BM
- 4.3 sec pri Common power failure, I am SyncTarget using BM
- 4.4 pri pri Common power failure, resync in arbitrary direction.
- 5. C = B I am SyncTarget using BM
- 6. C = H1|H2 I am SyncTarget setting BM
- 7. B = C I am SyncSource using BM
- 8. H1|H2 = C I am SyncSource setting BM
- 9. B = B [ and B != 0 ] SplitBrain, try auto recover strategies.
- 10 H1|H2 = H1|H2 SplitBrain, disconnect.
- 11. Warn about unrelated Data, disconnect.
-
- Upon Disconnect:
- Primary:
- Copy the current-UUID over to the bitmap-UUID, create a new
- current-UUID.
- Secondary:
- Nothing to do.
-
- Upon becomming Primary:
- In case we are disconnected and the bitmap-UUID is emptry, copy the
- current-UUID over to the bitmap-UUID and create a new current-UUID.
- Special-case: primary with --do-what-I-say, clearing the inconsistent
- flag causes a new UUID to be generated.
-
- Upon start of resync:
- Clear the consistent-flag on the SyncTarget. Generate a new UUID for
- the bitmap-UUID of the SyncSource and the current-UUID of the SyncTarget.
-
- Upon finish of resync:
- Set the bitmap-UUID to 0. The SyncTarget addopts the current-UUID
- of the SyncSource, and sets its consistent-flag.
-
- When the bitmap-UUID gets cleared, move the previous value to H1.
- In case H1 was already set copy its previous value to H2. Etc..
-
- For the auto recover strategies after split brain (see item 5)
- it is neccessary to embedd the node's role into the UUIDs.
- This is masked out of course when the UUIDs are compared.
-
- * Note1: Discontinue the --human and --timout options when
- becoming primary.
- NB: If they are needed, I think they can be implemented
- as special UUID values.
-
- 99% DONE. Kernel part is implemented, userland parts are implemented,
- --humand and --timeout-expired are removed.
- Everything seems to work so far.
-
- Known issues: we have to define behaviour for two-primaries case,
- and for connection loss when Primary with local disk != UpToDate.
-
-17 Something like
-
- drbdx: WARNING disk sizes more than 10% different
-
- would be nice at (initial) full sync.
- drbdx: WARNING disk sizes more than 10% different
-
-18 Connection-Teardown Packet. Currently the new state-checks
- disallows "drbdadm disconnect res" on the primary node of a
- connected cluster.
- Thes Teardown Packet causes the secondary-node to outdate
- its data and to close the connection in one go.
- 99% DONE.
-
-19 Make the updates to the bitmap transactional. Esp for resizing.
- Make updates to the superblock transactional
-
-20 There are quite a number of parameters that must be set equal
- (or some reciprocal) on the two nodes. We need to ensure that
- the config is valid, from a viewpoint of the whole cluster.
- E.g.
- protocol equal
- after-sb-0pri / discard-local/remote equal / reciprocal
- after-sb-1pri equal
- after-sb-2pri equal
- want_lose reciprocal
- two_primaries equal
- 99% DONE
-
-21 Write barriers in the kernel
- In Linux-2.6 write barriers in the block-io layer are represented as
- REQ_SOFTBARRIER, REQ_HARDBARRIER and REQ_NOMERGE flags on requests.
- In the BIO layer this is BIO_RW_BARRIER, which is usually set on
- BIO_RW (=write) requests.
-
- The REQ_HARDBARRIER bit is currently used to do a cache flush on
- IDE devices. Actually not all IDE devices can do cache flushes, there
- are some older models out there that can do write-caching but can
- not perform a cache flush!
-
- Journaling file systems should use this barrier mechanism in their journal
- writes (actually on the commit block, this is the last write in a
- transactional updated to the jouernal).
-
- As for DRBD we should probabely ship the REQ_HARDBARRIER flags with
- our wire protocol (or should they be expressed by Barrier packets?)
-
- We will only see such REQ_HARDBARRIER flags if we state to the upper layers
- that we are able to deal with them. We need to do this by announcing it:
- blk_queue_ordered(q, QUEUE_ORDERED_FLUSH or QUEUE_ORDERED_TAG ) .
- Default ist QUEUE_ORDERED_NONE. This is the reason why we never see
- the REQ_HARDBARRIER flag currently.
-
- An other consequence of this is, that IDE devices that do _not_ support
- cache flushes and have write cache enabled are inherent buggy to use with
- a journaled file system.
-
- SCSI's Tagged queuing (seems to be presenet in SATA as well)
- [excerpt from http://www.scsimechanic.com/scsi/SCSI2-07.html]
-
- Tagged queuing allows a target to accept multiple I/O processes from
- the same or different initiators until the logical unit's command queue
- is full.
-
- If only SIMPLE QUEUE TAG messages are used, the target may execute the
- commands in any order that is deemed desirable within the constraints
- of the queue management algorithm specified in the control mode page
- (see 8.3.3.1).
-
- If ORDERED QUEUE TAG messages are used, the target shall execute the
- commands in the order received with respect to other commands received
- with ORDERED QUEUE TAG messages. All commands received with a SIMPLE
- QUEUE TAG message prior to a command received with an ORDERED QUEUE
- TAG message, regardless of initiator, shall be executed before that
- command with the ORDERED QUEUE TAG message. All commands received with
- a SIMPLE QUEUE TAG message after a command received with an ORDERED
- QUEUE TAG message, regardless of initiator, shall be executed after
- that command with the ORDERED QUEUE TAG message.
-
- A command received with a HEAD OF QUEUE TAG message is placed first in
- the queue, to be executed next. A command received with a HEAD OF
- QUEUE TAG message shall be executed prior to any queued I/O
- process. Consecutive commands received with HEAD OF QUEUE TAG messages
- are executed in a last- in-first-out order.
-
- I think in the context of SCSI the kernel usually issues write requests
- with the SIMPLE QUEUE TAG, and requests with the REQ_HARDBARRIER
- (i.e. bio's with the BIO_RW_BARRIER) with an ORDERED QUEUE TAG.
-
- What QUEUE_ORDERED_ type should we expose ?
-
- In order to support capable IDE devices right, we should ship the
- BIO_RW_BARRIER bit with our data packets in case the peer's backing
- storage is of the QUEUE_ORDERED_FLUSH type.
-
- If both devices are of the QUEUE_ORDERED_TAG type should also claim
- to be of that type, and ship the BIO_RW_BARRIER bit as well.
-
- self peer DRBD
- ---------------------
- NONE , NONE => NONE
- NONE , FLUSH => NONE
- NONE , TAG => NONE
- FLUSH, NONE => NONE
- FLUSH, FLUSH => FLUSH
- FLUSH, TAG => FLUSH
- TAG, NONE => NONE
- TAG, FLUSH => FLUSH
- TAG, TAG => TAG
-
- How should we deal with our self generated barrier packets ?
-
- In case our backing device is of the QUEUE_ORDERED_NONE class, we
- have to stay with the current code.
-
- In case our backing device only supports QUEUE_ORDERED_FLUSH we
- will to use the current code. That means, when we receive a write
- barrier packet we wait until all of our pending local write
- requests are done. (This potentially causes congestion on the TCP
- socket...)
-
- In cause our backing device's queue properties are set to
- QUEUE_ORDERED_TAG we offload the complete barrier logic to the
- backing storage device:
-
- * When we receive a barrier packet
- - If we have no local pending requests, we send the barrier ACK
- immediately. (= current code)
- - If the last_barrier_write member of mdev points to an epoch_entry
- we set bit 31 of bnum.
- - If we have local pending requests, we set a flag that the next
- data packet has to be written with the BIO_RW_BARRIER flag.
- (That flag should be called BARRIER_NEEDED)
-
- * When receiving data packets we test_and_clear BARRIER_NEEDED,
- and add set the BIO_RW_BARRIER on the write request. We also set
- the last_barrier_write member of mdev.
- [Normal writes clear the last_barrier_write member of mdev]
-
- * When a write completes and it has the bnum set, send the barrier
- ack before sending the ack for the write. In case the highest
- bit of bnum is set as well, also send the barrier ack following
- the write ack of the data packet.
-
- 90% DONE [ Not tested yet. ]
-
-22 Reboot notifier.
-
-23 External imposed SyncPause states.
- There are two new commands: 'drbdadm pause-sync res'
- 'drbdadm resume-sync res'
- These may be used to suspend the resynchronisation process while
- e.g. the backing storages' raid controller does its resynchronisation.
-
- While implementing this, I also made shure that in a 3 node
- setup the two peers of a connection will agree if a resynchronisation
- is paused under all conditions you can think of, if there are more
- than two nodes!
-
- 99% DONE
-
-24 Make it possible to hot-add disk drives == Atomic configuration changes.
-
- 99% DONE
-
-25 Add reserved fields to DRBD-meta-data, add a bytes per bit field to
- metadata.
-
- 99% DONE
-
-26 Implement a kind of "dstate" command to make integration with
- Heartbeat-2.0's master/slave-support possible.
-
- 99% DONE
-
-27 Remove all explicit drbd_md_write() calls, and create a mechanism,
- that always keeps the on disk-metadata up-to-date implicit.
- Calling drbd_md_write() explicit is too errorprone.
-
- 99% DONE
-
-28 Implement a kind of 'call home', a single HTTP get request, that
- gets counted in a data base. The initiator calculates a simple
- hash over the machine and resource names. Each time a meta-data
- set gets generated, the 'call home' is initiated. The user might
- of course opt out of this.
-
- 99% DONE
-
-29 Make drbdadm to have 'hidden-commands' command to also show
- the hidden sub-commands in the ussage.
-
- 99% DONE
-
-30 The current drbdadm_scanner is 1MB in source and as binary.
- Use a _basic_ flex scanner, and a hand written parser for superb
- errror reporting.
-
- 99% DONE
-
-31 Resizing several GB results in ko-count timeouts, maybe since the
- secondary node does the enlargement of the bitmap in the receiver (?)
-
- DONE, by using the async bitmap IO code.
-
-32 drbdmeta: with internal meta-data v07 and v08 meta-data super blocks
- are in different places. -> It is possible to have v07 AND v08 meta
- data on one device.
- => drbdmeta should make sure that it overwrites the other location
- in case it create a meta-data block.
-
- 99% DONE
-
-33 Serialize state changes like secondary -> primary and
- Connected -> SyncSource in the cluster.
-
- role <- primary
- conn <- StartingSyncT (disk <- inconsistent)
- conn <- StartingSyncS (pdsk <- inconsistent)
- disk <- Diskless (as long as it happens as administrative command)
- pdsk <- Outdated (= a 'disconnect' issued on a primary node)
-
- * When a state change might sleep ( reuqest_state() ) and it is
- to be cluster wide atomic ( pre_state_checks() determines this!).
- 1. Aquire the cluster state change lock (bit & waitqueue) ?
- 2. We send a request_state packet.
-
- * When a request_state packet is received
-
- 1. * If we are UNIQUE we take the cluster lock (potentially
- waiting for it) and try to apply the remote's request
- as soon as we have the lock.
- * When we are not UNIQUE we try to apply the state change
- immediately (without taking the cluster lock).
- 2. We send the ACK / NACK.
- ( Do we actually need an ACK/NACK ?
- * On the not UNIQUE side, we will fail the request as
- soon as the offending state request comes in.
- * On the UNIQUE side we need to positive ACK to
- continue.
- ) I guess for the sake of completeness, we should
- have both packets, although currently the need for
- the NACK packet is not abvious.
-
- * When we receive an ACK / NACK we either sucessfully finish or
- fail the the request_state() call. (Error codes should be passed
- from the peer.)
-
- * When the connection failes ( = actually a non-cluster wide state
- change happens while a cluster wide state change goes on), we
- need to re-evaluate the pre state change check. In case the
- pre state change check allows the new state we can procees,
- otherwise we need to fail the request.
-
- * How to do the synchronisation form the receive of the ACK / NACK
- packet to the termination of the request_state() function ?
- * wait_queue & bit.
-
- DATA STRUCTURES:
- * A CLUSTER_STATE_CHANGE bit == the cluster lock bit.
- * A CL_ST_CHG_SUCCESS bit set by the receiver.
- * A CL_ST_CHG_FAIL bit set by the receiver.
- * A wait queue.
-
- TODOS:
- Evaluate if it is possible to use it for starting resync. (invalidate)
- Evaluate it for the other cases...
-
- 90 % Is implemented. Changing the role to primary already uses this
- mechanism. Starting resync with invalidate and invalidate_remote
- now also uses this method. Detaching now also uses this mechanism.
-
-34 Improve the initial hand-shake, to identify the sockets (and TCP-
- links) by an initial message, and not only by the connection timming.
-
- 99% DONE
-
-35 Bigger AL-extents (e.g. 16MB)
-
-36 Increase the number of UUID history slots.
-
-37 In case heartbeat (or some one else) makes us primary, we need to
- check first if the peer is alive.
- Currently we habe a problem is when heartbeat's dead time is smaller
- than DRBD's network timeout.
-
-38 Create an other on-io-error hander, that does retry failed read
- operations on the peer, but does not detach from the local disk.
- And it sets that block in the bitmap as out-of-date.
-
- Simon works on this.
-
-39 Send mirrored write requests out of the worker context.
- 99% DONE
-
-40 Do something with FLUSHBUFS ioctl.
-
-41 Fix DRBD's behaviour in case of a common power failuer and when
- both nodes were in primary role.
-
- See the the Algorithm of Item 16, section 4 to 4.4 .
-
- Further we need to have the resync rolces conflict "rr-conflict"
- strategy option with the following values:
-
- The available options are:
- "disconnect" ... No automatic resynchronisation, simply disconnect.
- "violently" .... Sync to the primary node is allowed, violating the
- assumption that data on a block device is stable
- for one of the nodes. DANGEROUS, DO NOT USE.
- "call-pri-lost"
- Call this helper program on one of the machines.
- This program is expected to halt or reboot the
- machine.
-
- An exception of course is a primary disk-less node that gets a disk
- attached. Such a nodes becomes sync target, but since it does not
- show a violently data change, this state transition is always allowed.
-
- 99% DONE
-
-42 Forward port the abilitiy to resume the TL after IO was frozen,
- in case the connection is reestablished again.
-
-43 Fix indexed meta-data.
-
-44 Callbacks to userspace should run asynchronous.
-
-Maybe:
-
-* Switch to protocol C in case we are running without a local
- disk and are configured to use protocol A or B.
-
-* Dynamic misc char device instead of IOCTLs for configuration. Evaluate
- if the configuration could be done over a netlink socket as well...
-
-* A netlink socket to communicate events to userspace.
- - All state changes
- - the need to outdate the peer
-
-* Write some heartbeat glue to do a gracefull switchover in case of
- a local IO failue. (requires the netlink socket thing)
-
-plus-banches:
-----------------------
-
-1 Make use-csums to use the kernel's crypto API
-
-2 Implement online verification
-
-3 Change the bitmap code to work with unmapped highmem pages, instead
- of using vmalloc()ed memory. This allows users of 32bit platforms
- to use drbd on big devices (in the ~3TB range)
-
-4 3 node support. Do and test a 3 node setup (2nd DRBD stacked over
- a DRBD pair). Enhance the user level tools to support the 3 node
- setup.
-
-5 Have protocol version 74 available in drbd-0.8, to allow rolling
- upgrades
-
+5. fix [or discuss away ;-)] anything else brought up on lkml
Modified: branches/drbd-8.2/user/drbdmeta.c
===================================================================
--- branches/drbd-8.2/user/drbdmeta.c 2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/user/drbdmeta.c 2007-09-07 10:15:46 UTC (rev 3060)
@@ -273,9 +273,9 @@
struct md_cpu md;
/* _byte_ offsets of our "super block" and other data, within fd */
- u64 md_offset;
- u64 al_offset;
- u64 bm_offset;
+ s64 md_offset;
+ s64 al_offset;
+ s64 bm_offset;
size_t md_mmaped_length;
size_t al_mmaped_length;
size_t bm_mmaped_length;
@@ -684,14 +684,14 @@
int v07_parse(struct format *cfg, char **argv, int argc, int *ai);
int v07_md_initialize(struct format *cfg);
void v07_md_erase_others(struct format *cfg);
-u64 v07_md_get_byte_offset(struct format * cfg);
+s64 v07_md_get_byte_offset(struct format * cfg);
int v08_md_open(struct format *cfg);
int v08_md_cpu_to_disk(struct format *cfg);
int v08_md_disk_to_cpu(struct format *cfg);
int v08_md_initialize(struct format *cfg);
void v08_md_erase_others(struct format *cfg);
-u64 v08_md_get_byte_offset(struct format * cfg);
+s64 v08_md_get_byte_offset(struct format * cfg);
struct format_ops f_ops[] = {
[Drbd_06] = {
@@ -881,7 +881,7 @@
}
int v07_style_md_open(struct format *cfg,
- u64 (*md_get_byte_offset) (struct format *),
+ s64 (*md_get_byte_offset) (struct format *),
size_t size)
{
struct stat sb;
@@ -905,7 +905,7 @@
exit(20);
}
- if (ioctl(cfg->md_fd, BLKFLSBUF) == -1) {
+ if (ioctl(cfg->md_fd, BLKFLSBUF, NULL) == -1) {
PERROR("WARN: ioctl(,BLKFLSBUF,) failed");
}
@@ -938,7 +938,7 @@
// For the case that someone modified la_sect by hand..
if( (cfg->md_index == DRBD_MD_INDEX_INTERNAL ||
cfg->md_index == DRBD_MD_INDEX_FLEX_INT ) &&
- (cfg->md.la_sect*512 > cfg->md_offset) ) {
+ (cfg->md.la_sect*512 > (u64)cfg->md_offset) ) {
printf("la-size-sect was too big, fixed.\n");
cfg->md.la_sect = cfg->md_offset/512;
}
@@ -972,7 +972,7 @@
}
void md_erase_sb(struct format *cfg,
- u64 (*md_get_byte_offset) (struct format *))
+ s64 (*md_get_byte_offset) (struct format *))
{
/* in case these are internal meta data, we need to
make sure that there is no v08 superblock at the end
@@ -980,7 +980,7 @@
unsigned char zero_sector[512];
struct format cfg_f;
- u64 offset;
+ s64 offset;
int bw;
if(cfg->md_index == DRBD_MD_INDEX_INTERNAL ||
@@ -992,6 +992,8 @@
in the front of the meta data area. */
offset = md_get_byte_offset(&cfg_f);
+ if (offset < 0)
+ return;
if(lseek64(cfg->md_fd, offset, SEEK_SET) == -1) {
PERROR("lseek64() failed");
exit(20);
@@ -1404,9 +1406,9 @@
begin of v07 {{{
******************************************/
-u64 v07_md_get_byte_offset(struct format *cfg)
+s64 v07_md_get_byte_offset(struct format *cfg)
{
- u64 offset;
+ s64 offset;
switch(cfg->md_index) {
default: /* external, some index */
@@ -1509,7 +1511,7 @@
PERROR("fsync() failed");
err = -1;
}
- if (ioctl(cfg->md_fd, BLKFLSBUF) == -1) {
+ if (ioctl(cfg->md_fd, BLKFLSBUF, NULL) == -1) {
PERROR("ioctl(,BLKFLSBUF,) failed");
err = -1;
}
@@ -1545,9 +1547,9 @@
begin of v08 {{{
******************************************/
-u64 v08_md_get_byte_offset(struct format *cfg)
+s64 v08_md_get_byte_offset(struct format *cfg)
{
- u64 offset;
+ s64 offset;
switch(cfg->md_index) {
default: /* external, some index */
Modified: branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c
===================================================================
--- branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c 2007-09-07 10:13:21 UTC (rev 3059)
+++ branches/drbd-8.2/user/drbdmeta_unfinished_rewrite.c 2007-09-07 10:15:46 UTC (rev 3060)
@@ -2028,7 +2028,7 @@
DRBD_MD_INDEX_FLEX_INT, cfg->bd_size);
printf("%lld\n%lld\n%lld\n", cfg->bd_size, fixed_offset, flex_offset);
- if (fixed_offset < (off_t)cfg->bd_size - 4096) {
+ if (0 <= fixed_offset && fixed_offset < (off_t)cfg->bd_size - 4096) {
/* ... v07 fixed-size internal meta data? */
PREAD(cfg->md_fd, on_disk_buffer, 4096, fixed_offset);
More information about the drbd-cvs
mailing list