[Drbd-dev] Protocol A,B & submit ee failure

Shriram Rajagopalan rshriram at gmail.com
Wed Nov 17 23:44:38 CET 2010


On Wed, Nov 17, 2010 at 12:27 PM, Lars Ellenberg
<lars.ellenberg at linbit.com>wrote:

> On Wed, Nov 17, 2010 at 11:00:03AM -0800, Shriram Rajagopalan wrote:
> > [I apologize if this is a double post]
>
> It's not.
> drbd-dev just happens to be moderated somewhat ;-)
>
> > Hi all,
> > I have recently started hacking into drbd kernel code
>
> Just curious: Why?
> What are you up to?
>
> Well there is this 4th protocol I want to try.
Protocol A + defer writes @ backup until P_BARRIER
(ie buffer the writes in current epoch until an explicit P_BARRIER).

  currently receive_Barrier only blocks on active_ee list, waiting
for completion of pending IOs.
  I want to issue the deferred writes in receive_Barrier and then
fall through the rest of the code.

Consider an App that is moderately io intensive and it needs transactional
view of disk to survive failures. It "buffers" any network output until the
next P_BARRIER.
It issues a P_BARRIER every T "milliseconds" where T = 50ms/100ms..whatever.

Since barrier ack implies that both disks are in sync, App @ primary will
send network output
only after receiving the barrier ACK.

Lets ignore the App side of fault tolerance and the network latency aspect
for the moment.
(and the argument that protocol C is almost as good as proto A). I chose
async replication
because this app demands so.

Failure:
On primary failure, secondary "discards" all writes buffered in the current
epoch
& activates secondary disk and spawns App. App @ secondary will always start
with
last consistent sync point.
On secondary failure, primary just moves on.

IOW, always sync from current primary to the node that comes back online.

On resync (assuming primary failure),
 secondary copies all data (using quick-sync bitmap and activity log) since
failover
to primary. Any disk regions touched by by primary before crash
have to be overwritten with a copy from the secondary.

What I am concerned about is that the sudden influx of IOs at every
P_BARRIER, @ secondary
might choke the kernel and cause drbd_submit_ee to fail. If this failure
happens rarely, its ok but
if it occurs regularly, then I would have to control the rate at which the
deferred IOs are flushed
to disk.

If the application were to do say 2000 writes (4k sized writes, so 8MB)
within 50ms under peak load,
will the flush at secondary "fail" or will it be just slower than normal
(because do_generic_request in drbd_submit_ee just blocks) ?

> and I am bit of a newbie to the concept of "bio"s.
> >
> > my question: (all concerning IO at secondary, for Protocol A/B)
> >  In drbd_receiver.c, esp in function receive_Data(..),
> > the backup disconnects from primary when drbd_submit_ee(..) call fails.
> > The comments indicate
> >         /* drbd_submit_ee currently fails for one reason only:
> >          * not being able to allocate enough bios.
> >          * Is dropping the connection going to help? */
> >
> > So, the code just finishes the activity log io, releases the ee and
> returns
> > false,
> > which causes the main loop to disconnect from primary.
> >
> > Why was this choice made?
>
> It grew that way.
>
> It does not happen, I am not aware of that code path having ever been
> taken. If it should be taken some day, we'll likely fix it so it won't
> happen again.
>
> now thats a relief.

> But as long as drbd_submit_ee can fail (in theory) there needs to be an
> error handling branch (that is at least "correct").
>
> Disconnecting/reconnecting was the easiest possible error path here.
>
> But see my comment about biosets below.
>
> > Please correct me if I am wrong:
> > Isnt failure to allocate a bio a temporary issue? I mean the kernel ran
> out
> > of bio's to allocate out of its slabs (or short of memory currently)
> > and thus retrying again after a while might work.
> >
> > I understand that for protocol C, one cannot buffer the IO on
> > secondary. But for Protocol A/B, they can certainly be buffered and
> > retried. Isnt that better than just disconnecting from primary and
> > causing reconnects?
> > ==========
>
>
> > On the same note,
> > function "w_e_reissue" callback is used to resubmit a failed IO , if the
> IO
> > had REQ_HARDBARRIER flag.
>
> Which is obsolete, btw, and will go away.  Recent kernels have REQ_FUA |
> REQ_FLUSH, which will not fail but for real IO error.
>
> > Looking at this function, it tries to reissue the IO and
> >  (a) when drbd_submit_ee fails,
> >     it installs itself as the callback handler and re queues the work.
> This
> > contradicts with the receive_Data(..)
>
> So what.
> It does not happen anyways, it just needed to be "correct".
> And, in this case we know that there just now had been enough bios for
> this ee, we just gave them back to the system, it is highly likely that
> we get enough bios back again.
>
> this is kind of what I plan to do, if drbd_submit_ee starts failing when
flushing
the deferred writes. Try to queue up one request for every bio completion.

> > error handling, where drbd_submit_ee call failure leads to connection
> > termination.
> >
> >    Also, this could cause potential looping (probably infinite) when the
> > submit_ee call keeps failing due to ENOMEM.
>
> Did not you yourself suggest retrying, stating that not being able to
> allocate bios was only a temporary problem? ;-)
>
> >    shouldnt there be some sort of "num_attempts" counter that limits
> number
> > of IO retries?
>
> No.
>
> There should probably be a dedicated drbd bioset where we allocate bios
> from.
> It has not been an issue, so we did not implement it yet.
> If you want to do that, it was quite easy,
> would change drbd internal bio_alloc to bio_alloc_bioset from that drbd
> bioset, could make drbd_submit_ee void (won't fail), etc.
> Send patch to this list, or PM, whatever you prefer.
>
> > the comments in this function
> > "@cancel  The connection will be closed anyways (unused in this
> callback)"
> > I cannot find a control path that causes a connection close, before
> reaching
> > this function.
>
> Those are asynchronous, and may happen any time, whenever drbd detects
> that we have a problem with our replication link.
> The worker then calls all callbacks with this "cancel" flag set, so they
> know early on that there is no point in trying to send anything over (in
> case they wanted to).
>
> > On the other hand,
> >  drbd_endio_sec --> drbd_endio_sec_final
>
> which both are void...
>
> >    where this ee is simply requeued, with its callback changed to
> > w_e_reissue which always returns 1.
>
> Yes. We don't want the worker to die.
>
> Again, this "grew that way".
> I think in the long run, the cancel parameter to our work callbacks
> may be dropped, and they may become void. But that's not particularly
> urgent.
>
> >    (unlike e_end_block which returns 0 causing the worker thread to force
> > connection to go down)
>
> No, that causes the _asender_ thread (not the worker) to _notice_
> that the connection was lost (it has not been able to send an ACK).
> But this again is probably not necessary anymore as we already called
> into the state handling from where the send actually failed,
> and possibly could become void.
>
> hth,
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> drbd-dev mailing list
> drbd-dev at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
>



-- 
perception is but an offspring of its own self
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-dev/attachments/20101117/3c735a5e/attachment-0001.htm>


More information about the drbd-dev mailing list