<br><div class="gmail_quote">On Wed, Nov 17, 2010 at 12:27 PM, Lars Ellenberg <span dir="ltr">&lt;<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">On Wed, Nov 17, 2010 at 11:00:03AM -0800, Shriram Rajagopalan wrote:<br>

&gt; [I apologize if this is a double post]<br>

<br>

</div>It&#39;s not.<br>

drbd-dev just happens to be moderated somewhat ;-)<br>

<div class="im"><br>

&gt; Hi all,<br>

&gt; I have recently started hacking into drbd kernel code<br>

<br>

</div>Just curious: Why?<br>

What are you up to?<br>

<div class="im"><br></div></blockquote><div>Well there is this 4th protocol I want to try.<br>Protocol A + defer writes @ backup until P_BARRIER <br>(ie buffer the writes in current epoch until an explicit P_BARRIER).<br>

<br>  currently receive_Barrier only blocks on active_ee list, waiting<br>for completion of pending IOs. <br>  I want to issue the deferred writes in receive_Barrier and then<br>fall through the rest of the code.<br><br>Consider an App that is moderately io intensive and it needs transactional <br>

view of disk to survive failures. It &quot;buffers&quot; any network output until the next P_BARRIER.<br>It issues a P_BARRIER every T &quot;milliseconds&quot; where T = 50ms/100ms..whatever. <br>Since barrier ack implies that both disks are in sync, App @ primary will send network output <br>

only after receiving the barrier ACK.<br><br>Lets ignore the App side of fault tolerance and the network latency aspect for the moment.<br>(and the argument that protocol C is almost as good as proto A). I chose async replication<br>

because this app demands so.<br><br>Failure:<br>On primary failure, secondary &quot;discards&quot; all writes buffered in the current epoch <br>&amp; activates secondary disk and spawns App. App @ secondary will always start with<br>

last consistent sync point.<br>On secondary failure, primary just moves on.<br><br>IOW, always sync from current primary to the node that comes back online.<br><br>On resync (assuming primary failure),<br> secondary copies all data (using quick-sync bitmap and activity log) since failover<br>

to primary. Any disk regions touched by by primary before crash<br>have to be overwritten with a copy from the secondary.<br> <br>What I am concerned about is that the sudden influx of IOs at every P_BARRIER, @ secondary<br>

might choke the kernel and cause drbd_submit_ee to fail. If this failure happens rarely, its ok but<br>if it occurs regularly, then I would have to control the rate at which the deferred IOs are flushed<br>to disk.<br><br>

If the application were to do say 2000 writes (4k sized writes, so 8MB) within 50ms under peak load,<br>will the flush at secondary &quot;fail&quot; or will it be just slower than normal <br>(because do_generic_request in drbd_submit_ee just blocks) ?<br>

<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">&gt; and I am bit of a newbie to the concept of &quot;bio&quot;s.<br>


&gt;<br>

&gt; my question: (all concerning IO at secondary, for Protocol A/B)<br>

&gt;  In drbd_receiver.c, esp in function receive_Data(..),<br>

&gt; the backup disconnects from primary when drbd_submit_ee(..) call fails.<br>

&gt; The comments indicate<br>

&gt;         /* drbd_submit_ee currently fails for one reason only:<br>

&gt;          * not being able to allocate enough bios.<br>

&gt;          * Is dropping the connection going to help? */<br>

&gt;<br>

&gt; So, the code just finishes the activity log io, releases the ee and returns<br>

&gt; false,<br>

&gt; which causes the main loop to disconnect from primary.<br>

&gt;<br>

&gt; Why was this choice made?<br>

<br>

</div>It grew that way.<br>

<br>

It does not happen, I am not aware of that code path having ever been<br>

taken. If it should be taken some day, we&#39;ll likely fix it so it won&#39;t<br>

happen again.<br>

<br></blockquote><div>now thats a relief.  <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

But as long as drbd_submit_ee can fail (in theory) there needs to be an<br>

error handling branch (that is at least &quot;correct&quot;).<br>

<br>

Disconnecting/reconnecting was the easiest possible error path here.<br>

<br>

But see my comment about biosets below.<br>

<div class="im"><br>

&gt; Please correct me if I am wrong:<br>

&gt; Isnt failure to allocate a bio a temporary issue? I mean the kernel ran out<br>

&gt; of bio&#39;s to allocate out of its slabs (or short of memory currently)<br>

&gt; and thus retrying again after a while might work.<br>

&gt;<br>

&gt; I understand that for protocol C, one cannot buffer the IO on<br>

&gt; secondary. But for Protocol A/B, they can certainly be buffered and<br>

&gt; retried. Isnt that better than just disconnecting from primary and<br>

&gt; causing reconnects?<br>

&gt; ==========<br>

<br>

<br>

&gt; On the same note,<br>

&gt; function &quot;w_e_reissue&quot; callback is used to resubmit a failed IO , if the IO<br>

&gt; had REQ_HARDBARRIER flag.<br>

<br>

</div>Which is obsolete, btw, and will go away.  Recent kernels have REQ_FUA |<br>

REQ_FLUSH, which will not fail but for real IO error.<br>

<div class="im"><br>

&gt; Looking at this function, it tries to reissue the IO and<br>

&gt;  (a) when drbd_submit_ee fails,<br>

&gt;     it installs itself as the callback handler and re queues the work. This<br>

&gt; contradicts with the receive_Data(..)<br>

<br>

</div>So what.<br>

It does not happen anyways, it just needed to be &quot;correct&quot;.<br>

And, in this case we know that there just now had been enough bios for<br>

this ee, we just gave them back to the system, it is highly likely that<br>

we get enough bios back again.<br>

<div class="im"><br></div></blockquote><div>this is kind of what I plan to do, if drbd_submit_ee starts failing when flushing<br>the deferred writes. Try to queue up one request for every bio completion. <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">

&gt; error handling, where drbd_submit_ee call failure leads to connection<br>

&gt; termination.<br>

&gt;<br>

&gt;    Also, this could cause potential looping (probably infinite) when the<br>

&gt; submit_ee call keeps failing due to ENOMEM.<br>

<br>

</div>Did not you yourself suggest retrying, stating that not being able to<br>

allocate bios was only a temporary problem? ;-)<br>

<div class="im"><br>

&gt;    shouldnt there be some sort of &quot;num_attempts&quot; counter that limits number<br>

&gt; of IO retries?<br>

<br>

</div>No.<br>

<br>

There should probably be a dedicated drbd bioset where we allocate bios from.<br>

It has not been an issue, so we did not implement it yet.<br>

If you want to do that, it was quite easy,<br>

would change drbd internal bio_alloc to bio_alloc_bioset from that drbd<br>

bioset, could make drbd_submit_ee void (won&#39;t fail), etc.<br>

Send patch to this list, or PM, whatever you prefer.<br>

<div class="im"><br>

&gt; the comments in this function<br>

&gt; &quot;@cancel  The connection will be closed anyways (unused in this callback)&quot;<br>

&gt; I cannot find a control path that causes a connection close, before reaching<br>

&gt; this function.<br>

<br>

</div>Those are asynchronous, and may happen any time, whenever drbd detects<br>

that we have a problem with our replication link.<br>

The worker then calls all callbacks with this &quot;cancel&quot; flag set, so they<br>

know early on that there is no point in trying to send anything over (in<br>

case they wanted to).<br>

<div class="im"><br>

&gt; On the other hand,<br>

&gt;  drbd_endio_sec --&gt; drbd_endio_sec_final<br>

<br>

</div>which both are void...<br>

<div class="im"><br>

&gt;    where this ee is simply requeued, with its callback changed to<br>

&gt; w_e_reissue which always returns 1.<br>

<br>

</div>Yes. We don&#39;t want the worker to die.<br>

<br>

Again, this &quot;grew that way&quot;.<br>

I think in the long run, the cancel parameter to our work callbacks<br>

may be dropped, and they may become void. But that&#39;s not particularly<br>

urgent.<br>

<div class="im"><br>

&gt;    (unlike e_end_block which returns 0 causing the worker thread to force<br>

&gt; connection to go down)<br>

<br>

</div>No, that causes the _asender_ thread (not the worker) to _notice_<br>

that the connection was lost (it has not been able to send an ACK).<br>

But this again is probably not necessary anymore as we already called<br>

into the state handling from where the send actually failed,<br>

and possibly could become void.<br>

<br>

hth,<br>

<br>

--<br>

: Lars Ellenberg<br>

: LINBIT | Your Way to High Availability<br>

: DRBD/HA support and consulting <a href="http://www.linbit.com" target="_blank">http://www.linbit.com</a><br>

<br>

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.<br>

_______________________________________________<br>

drbd-dev mailing list<br>

<a href="mailto:drbd-dev@lists.linbit.com">drbd-dev@lists.linbit.com</a><br>

<a href="http://lists.linbit.com/mailman/listinfo/drbd-dev" target="_blank">http://lists.linbit.com/mailman/listinfo/drbd-dev</a><br>

</blockquote></div><br><br clear="all"><br>-- <br>perception is but an offspring of its own self<br>