<br><div class="gmail_quote">On Wed, Nov 17, 2010 at 12:27 PM, Lars Ellenberg <span dir="ltr"><<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">On Wed, Nov 17, 2010 at 11:00:03AM -0800, Shriram Rajagopalan wrote:<br>
> [I apologize if this is a double post]<br>
<br>
</div>It's not.<br>
drbd-dev just happens to be moderated somewhat ;-)<br>
<div class="im"><br>
> Hi all,<br>
> I have recently started hacking into drbd kernel code<br>
<br>
</div>Just curious: Why?<br>
What are you up to?<br>
<div class="im"><br></div></blockquote><div>Well there is this 4th protocol I want to try.<br>Protocol A + defer writes @ backup until P_BARRIER <br>(ie buffer the writes in current epoch until an explicit P_BARRIER).<br>
<br> currently receive_Barrier only blocks on active_ee list, waiting<br>for completion of pending IOs. <br> I want to issue the deferred writes in receive_Barrier and then<br>fall through the rest of the code.<br><br>Consider an App that is moderately io intensive and it needs transactional <br>
view of disk to survive failures. It "buffers" any network output until the next P_BARRIER.<br>It issues a P_BARRIER every T "milliseconds" where T = 50ms/100ms..whatever. <br>Since barrier ack implies that both disks are in sync, App @ primary will send network output <br>
only after receiving the barrier ACK.<br><br>Lets ignore the App side of fault tolerance and the network latency aspect for the moment.<br>(and the argument that protocol C is almost as good as proto A). I chose async replication<br>
because this app demands so.<br><br>Failure:<br>On primary failure, secondary "discards" all writes buffered in the current epoch <br>& activates secondary disk and spawns App. App @ secondary will always start with<br>
last consistent sync point.<br>On secondary failure, primary just moves on.<br><br>IOW, always sync from current primary to the node that comes back online.<br><br>On resync (assuming primary failure),<br> secondary copies all data (using quick-sync bitmap and activity log) since failover<br>
to primary. Any disk regions touched by by primary before crash<br>have to be overwritten with a copy from the secondary.<br> <br>What I am concerned about is that the sudden influx of IOs at every P_BARRIER, @ secondary<br>
might choke the kernel and cause drbd_submit_ee to fail. If this failure happens rarely, its ok but<br>if it occurs regularly, then I would have to control the rate at which the deferred IOs are flushed<br>to disk.<br><br>
If the application were to do say 2000 writes (4k sized writes, so 8MB) within 50ms under peak load,<br>will the flush at secondary "fail" or will it be just slower than normal <br>(because do_generic_request in drbd_submit_ee just blocks) ?<br>
<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">> and I am bit of a newbie to the concept of "bio"s.<br>
><br>
> my question: (all concerning IO at secondary, for Protocol A/B)<br>
> In drbd_receiver.c, esp in function receive_Data(..),<br>
> the backup disconnects from primary when drbd_submit_ee(..) call fails.<br>
> The comments indicate<br>
> /* drbd_submit_ee currently fails for one reason only:<br>
> * not being able to allocate enough bios.<br>
> * Is dropping the connection going to help? */<br>
><br>
> So, the code just finishes the activity log io, releases the ee and returns<br>
> false,<br>
> which causes the main loop to disconnect from primary.<br>
><br>
> Why was this choice made?<br>
<br>
</div>It grew that way.<br>
<br>
It does not happen, I am not aware of that code path having ever been<br>
taken. If it should be taken some day, we'll likely fix it so it won't<br>
happen again.<br>
<br></blockquote><div>now thats a relief. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
But as long as drbd_submit_ee can fail (in theory) there needs to be an<br>
error handling branch (that is at least "correct").<br>
<br>
Disconnecting/reconnecting was the easiest possible error path here.<br>
<br>
But see my comment about biosets below.<br>
<div class="im"><br>
> Please correct me if I am wrong:<br>
> Isnt failure to allocate a bio a temporary issue? I mean the kernel ran out<br>
> of bio's to allocate out of its slabs (or short of memory currently)<br>
> and thus retrying again after a while might work.<br>
><br>
> I understand that for protocol C, one cannot buffer the IO on<br>
> secondary. But for Protocol A/B, they can certainly be buffered and<br>
> retried. Isnt that better than just disconnecting from primary and<br>
> causing reconnects?<br>
> ==========<br>
<br>
<br>
> On the same note,<br>
> function "w_e_reissue" callback is used to resubmit a failed IO , if the IO<br>
> had REQ_HARDBARRIER flag.<br>
<br>
</div>Which is obsolete, btw, and will go away. Recent kernels have REQ_FUA |<br>
REQ_FLUSH, which will not fail but for real IO error.<br>
<div class="im"><br>
> Looking at this function, it tries to reissue the IO and<br>
> (a) when drbd_submit_ee fails,<br>
> it installs itself as the callback handler and re queues the work. This<br>
> contradicts with the receive_Data(..)<br>
<br>
</div>So what.<br>
It does not happen anyways, it just needed to be "correct".<br>
And, in this case we know that there just now had been enough bios for<br>
this ee, we just gave them back to the system, it is highly likely that<br>
we get enough bios back again.<br>
<div class="im"><br></div></blockquote><div>this is kind of what I plan to do, if drbd_submit_ee starts failing when flushing<br>the deferred writes. Try to queue up one request for every bio completion. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">
> error handling, where drbd_submit_ee call failure leads to connection<br>
> termination.<br>
><br>
> Also, this could cause potential looping (probably infinite) when the<br>
> submit_ee call keeps failing due to ENOMEM.<br>
<br>
</div>Did not you yourself suggest retrying, stating that not being able to<br>
allocate bios was only a temporary problem? ;-)<br>
<div class="im"><br>
> shouldnt there be some sort of "num_attempts" counter that limits number<br>
> of IO retries?<br>
<br>
</div>No.<br>
<br>
There should probably be a dedicated drbd bioset where we allocate bios from.<br>
It has not been an issue, so we did not implement it yet.<br>
If you want to do that, it was quite easy,<br>
would change drbd internal bio_alloc to bio_alloc_bioset from that drbd<br>
bioset, could make drbd_submit_ee void (won't fail), etc.<br>
Send patch to this list, or PM, whatever you prefer.<br>
<div class="im"><br>
> the comments in this function<br>
> "@cancel The connection will be closed anyways (unused in this callback)"<br>
> I cannot find a control path that causes a connection close, before reaching<br>
> this function.<br>
<br>
</div>Those are asynchronous, and may happen any time, whenever drbd detects<br>
that we have a problem with our replication link.<br>
The worker then calls all callbacks with this "cancel" flag set, so they<br>
know early on that there is no point in trying to send anything over (in<br>
case they wanted to).<br>
<div class="im"><br>
> On the other hand,<br>
> drbd_endio_sec --> drbd_endio_sec_final<br>
<br>
</div>which both are void...<br>
<div class="im"><br>
> where this ee is simply requeued, with its callback changed to<br>
> w_e_reissue which always returns 1.<br>
<br>
</div>Yes. We don't want the worker to die.<br>
<br>
Again, this "grew that way".<br>
I think in the long run, the cancel parameter to our work callbacks<br>
may be dropped, and they may become void. But that's not particularly<br>
urgent.<br>
<div class="im"><br>
> (unlike e_end_block which returns 0 causing the worker thread to force<br>
> connection to go down)<br>
<br>
</div>No, that causes the _asender_ thread (not the worker) to _notice_<br>
that the connection was lost (it has not been able to send an ACK).<br>
But this again is probably not necessary anymore as we already called<br>
into the state handling from where the send actually failed,<br>
and possibly could become void.<br>
<br>
hth,<br>
<br>
--<br>
: Lars Ellenberg<br>
: LINBIT | Your Way to High Availability<br>
: DRBD/HA support and consulting <a href="http://www.linbit.com" target="_blank">http://www.linbit.com</a><br>
<br>
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.<br>
_______________________________________________<br>
drbd-dev mailing list<br>
<a href="mailto:drbd-dev@lists.linbit.com">drbd-dev@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-dev" target="_blank">http://lists.linbit.com/mailman/listinfo/drbd-dev</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>perception is but an offspring of its own self<br>