[Drbd-dev] GFS support in DRBD-0.8

Philipp Reisner philipp.reisner at linbit.com
Wed Sep 22 15:18:45 CEST 2004


[...]
> >   Proposed Solution 1, using the order of a coordinator node:
> >
> >   Writes from the coordinator node are carried out, as they are
> >   carried out on the primary node in conventional DRBD. ( Write
> >   to disk and send to peer simultaniously. )
> >
> >   Writes from the other node are sent to the coordinator first,
> >   then the coordinator inserts a small "write now" packet into
> >   its stram of write packets.
> >   The node commits the write to its local IO subsystem as soon
> >   as it gets the "write-now" packet from the coordinator.
> >
> >   Note: With protocol C it does not matter which node is the
> >         coordinator from the performance viewpoint.
> >
> >   Proposed Solution 2, use ALs as distributed locks:
> >
> >   Only one node might mark an extent as active at a time. New
> >   packets are introduced to request the locking of an extent.
> > --snap--
> >
> > PS: I think that we do not need to use the AL extents as
> >     distributed locks.
>
> we don't need to, and it will probably be simpler to implement with S1.
> but S2 will most likely scale better as soon as we introduce more than
> two nodes, and maybe already whith only two nodes, since I expect GFS
> and similar systems to coordinate on the higher level already, so that
> typically (think of for example the per-node-journals) there won't be
> real concurrent access to the same area of the device.

DRBD-0.8 will strictly be 2 nodes. For the two node case it has
principal the same latency with protocol C 
 (see the attached PDF, N2 initiates the write, ... the path until 
  IO completion can be signalled is equally long.)

with S2 we have one packet less that travels over the wire per write
request, thus less interrupts less CPU load etc... more performace
in real live.

But with S2 a extent ping-pong will be *really* expensive. 

PS: You mentioned that you want to use an other term for 
    extent. Why ? The expression extent is used in LVM1 for
    the smalles unit of allocation by default 4M. 
    I think it is a good term for what we mean...

Ok, lets consider S2:
Why is it a good idea to unify the AL-extents and the lock-extents ?

pro: we already have AL-extents.
con: it is an other thing!

I think it would be wise to have an independent LRU cache for lock-extents

pro: other extent sizes possible.
pro: other cahce sizes possible.
pro: deleteion from cache (other node needs that extent) is cheap! no 
     meta-data update.
con: more code. (but LRU is already nicely abstraced anyway)

I am willing to agree on S2 as soon as I know that it will fit 
GFS's ussage patter. I tried to find a paper on the on-disk
layout of GFS, but was in a 30 minute seach not successfull....

> note that I think either way we need to get rid of the current scheme of
> "throttling" io in the tcp buffer by doing all network and disk io
> directly in the process context of the submitting process. we should
> instead have our own queue, with some maximum length, and let the worker
> do the work. yes this introduces more context switches.  but I really
> doubt that this is a performance problem on todays boxes.

Tell me one reason for this other than "I think we need..."

>
> I'd like to keep Primary, but introduce "active" as well, so we can have
> active Secondaries. a Primary is by definition always active.
>

So it would be Primary/Active ?? What is the difference between
an Active and an Primary node ?

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GFS-mode-options.pdf
Type: application/pdf
Size: 9808 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20040922/7b091b45/GFS-mode-options.pdf


More information about the drbd-dev mailing list