On Thu, Apr 5, 2012 at 11:53 AM, Florian Haas <span dir="ltr">&lt;<a href="mailto:florian@hastexo.com">florian@hastexo.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Thu, Apr 5, 2012 at 8:34 PM, Brian Chrisman &lt;<a href="mailto:brchrisman@gmail.com">brchrisman@gmail.com</a>&gt; wrote:<br>

&gt; I have a shared/parallel filesystem on top of drbd dual primary/protocol C<br>

&gt; (using 8.3.11 right now).<br>

<br>

</div>_Which_ filesystem precisely?<br></blockquote><div><br></div><div>I&#39;m testing this with GPFS.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im"><br>

&gt; My question is about recovering after a network outage where I have a<br>

&gt; &#39;resource-and-stonith&#39; fence handler which panics both systems as soon as<br>

&gt; possible.<br>

<br>

</div>Self-fencing is _not_ how a resource-and-stonith fencing handler is<br>

meant to operate.<br></blockquote><div><br></div><div>I&#39;m not concerned about basic disconnect where I can use a tie breaker setup (I do have a fencing setup which looks like it handles that just fine, ie, selecting the &#39;working&#39; node --defined by cluster membership-- to continue/resume IO).  I&#39;m talking about something more apocalyptic where both nodes can&#39;t contact a tie breaker.  At this point I don&#39;t care about having a node continue operations, I just want to make sure there&#39;s no data corruption.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br>

&gt; Even with Protocol-C, can the bitmaps still have dirty bits set? (ie,<br>

&gt; different writes on each local device which haven&#39;t returned/acknowledged to<br>

&gt; the shared filesystem because they haven&#39;t yet been written remotely?)<br>

<br>

</div>The bitmaps only apply to background synchronization. Foreground<br>

replication does not use the quick-sync bitmap.<br></blockquote><div><br></div><div>I was reading in the documentation that when a disconnect event occurred, there&#39;s a UUID-shuffle where the &#39;current&#39; -&gt; &#39;bitmap&#39; -&gt; historic... and &#39;new&#39; becomes &#39;current&#39;.  Is that the scheme we&#39;re discussing that&#39;s only applicable to background sync?</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br>

&gt; Maybe a more concrete example will make my question clearer:<br>

&gt; - node A &amp; B (2 node cluster) are operating nominally in primary/primary<br>

&gt; mode (shared filesystem provides locking and prevents simultaneous write<br>

&gt; access to the same blocks on the shared disk).<br>

&gt; - node A: write to drbd device, block 234567, written locally, but remote<br>

&gt; copy does not complete due to network failure<br>

&gt; - node B: write to drbd device, block 876543, written locally, but remote<br>

&gt; copy does not complete due to network failure<br>

<br>

</div>Makes sense up to here.<br>

<div class="im"><br>

&gt; - Both writes do not complete and do not return successfully to the<br>

&gt; filesystem (protocolC).<br>

<br>

</div>You are aware that &quot;do not return successfully&quot; means that no<br>

completion is signaled, which is correct, but not that non-completion<br>

is signaled, which would be incorrect?<br></blockquote><div><br></div><div>Yeah, I suppose there are a whole host of issues with this in regard to sync/async writes, but my expectation was that a synchronous call would hang.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br>

&gt; - Fencing handler is invoked, where I can suspend-io and/or panic both nodes<br>

&gt; (since neither one is reliable at this point).<br>

<br>

</div>&quot;Panicking&quot; a node is pointless, and panicking both is even worse.<br>

What fencing is meant to do is use an alternate communications channel<br>

to remove the _other_ node, not the local one. And only one of them<br>

will win.<br></blockquote><div><br></div><div>I was expecting fencing to basically mean the same thing as in the old SAN sense of &#39;fencing off&#39; a path to a device such that a surviving node can tell the SAN &quot;shut out those node that&#39;s screwed up/don&#39;t allow it to write&quot;.  In the apocalyptic case, I was using (perhaps abusing) this as a callout in the case where a drbd network dies.  But I suppose that this would be the same scenario (if I crashed the nodes) as if there was a simultaneous power failure to both nodes.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br>

&gt; If there is a chance of having unreplicated/unacknowledged writes on two<br>

&gt; different disks (those writes can&#39;t conflict, because the shared filesystem<br>

&gt; wont write to the same blocks on both nodes simultaneously), is there a<br>

&gt; resync option that will effectively &#39;revert&#39; any unreplicated/unacknowledged<br>

&gt; writes?<br>

<br>

</div>Yes, it&#39;s called the Activity Log, but you&#39;ve got this part wrong as<br>

you&#39;re under an apparent misconception as to what the fencing handler<br>

should be doing.<br></blockquote><div><br></div><div>My impression of the fencing handler, with the &#39;resource-and-stonith&#39; option selected is:</div><div>When a write can&#39;t be completed to the remote disk, immediately suspend all requests and call the provided fencing handler.  If the fence handler returns 7, then continue on in standalone mode (well, that&#39;s what I&#39;ve been intending to use it for).</div>

<div><br></div><div>The fence handler can/does get invoked on both nodes in primary/primary, though not necessarily both at the same time.  It seems once either fs client/app issues a write to drbd, and it can&#39;t contact its peer, it invokes the fencing handler (which is what I want).</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br>

&gt; I am considering writing a test for this and would like to know a bit more<br>

&gt; about what to expect before I do so.<br>

<br>

</div>Tell us what exactly you&#39;re trying to achieve please?<br></blockquote><div><br></div><div>My current state:</div><div>My current setup is such that drbd in primary/primary handles a node being disconnected from a cluster just fine (with a quorum indicating the surviving node).  I&#39;ve been able to recover from that (treating the surviving node as &#39;good&#39; for continuity purposes).  When the disconnected node reconnects, it has to become secondary and sync to the &#39;good&#39; node, discarding, etc.</div>

<div><br></div><div>I was concerned that an apocalyptic outage (where everybody loses quorum), can be recovered from.  I hadn&#39;t read up on the activity log before, but that&#39;s indeed what I was looking for.  If there&#39;s a primary/primary setup and the whole cluster loses power, then each peer in the drbd device will rollback to a consistent point in the activity log?</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class="HOEnZb"><font color="#888888"><br>

Florian<br>

<br>

--<br>

Need help with High Availability?<br>

<a href="http://www.hastexo.com/now" target="_blank">http://www.hastexo.com/now</a><br>

</font></span></blockquote></div><br>