On Thu, Apr 5, 2012 at 11:53 AM, Florian Haas <span dir="ltr"><<a href="mailto:florian@hastexo.com">florian@hastexo.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Thu, Apr 5, 2012 at 8:34 PM, Brian Chrisman <<a href="mailto:brchrisman@gmail.com">brchrisman@gmail.com</a>> wrote:<br>
> I have a shared/parallel filesystem on top of drbd dual primary/protocol C<br>
> (using 8.3.11 right now).<br>
<br>
</div>_Which_ filesystem precisely?<br></blockquote><div><br></div><div>I'm testing this with GPFS.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> My question is about recovering after a network outage where I have a<br>
> 'resource-and-stonith' fence handler which panics both systems as soon as<br>
> possible.<br>
<br>
</div>Self-fencing is _not_ how a resource-and-stonith fencing handler is<br>
meant to operate.<br></blockquote><div><br></div><div>I'm not concerned about basic disconnect where I can use a tie breaker setup (I do have a fencing setup which looks like it handles that just fine, ie, selecting the 'working' node --defined by cluster membership-- to continue/resume IO). I'm talking about something more apocalyptic where both nodes can't contact a tie breaker. At this point I don't care about having a node continue operations, I just want to make sure there's no data corruption.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> Even with Protocol-C, can the bitmaps still have dirty bits set? (ie,<br>
> different writes on each local device which haven't returned/acknowledged to<br>
> the shared filesystem because they haven't yet been written remotely?)<br>
<br>
</div>The bitmaps only apply to background synchronization. Foreground<br>
replication does not use the quick-sync bitmap.<br></blockquote><div><br></div><div>I was reading in the documentation that when a disconnect event occurred, there's a UUID-shuffle where the 'current' -> 'bitmap' -> historic... and 'new' becomes 'current'. Is that the scheme we're discussing that's only applicable to background sync?</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> Maybe a more concrete example will make my question clearer:<br>
> - node A & B (2 node cluster) are operating nominally in primary/primary<br>
> mode (shared filesystem provides locking and prevents simultaneous write<br>
> access to the same blocks on the shared disk).<br>
> - node A: write to drbd device, block 234567, written locally, but remote<br>
> copy does not complete due to network failure<br>
> - node B: write to drbd device, block 876543, written locally, but remote<br>
> copy does not complete due to network failure<br>
<br>
</div>Makes sense up to here.<br>
<div class="im"><br>
> - Both writes do not complete and do not return successfully to the<br>
> filesystem (protocolC).<br>
<br>
</div>You are aware that "do not return successfully" means that no<br>
completion is signaled, which is correct, but not that non-completion<br>
is signaled, which would be incorrect?<br></blockquote><div><br></div><div>Yeah, I suppose there are a whole host of issues with this in regard to sync/async writes, but my expectation was that a synchronous call would hang.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> - Fencing handler is invoked, where I can suspend-io and/or panic both nodes<br>
> (since neither one is reliable at this point).<br>
<br>
</div>"Panicking" a node is pointless, and panicking both is even worse.<br>
What fencing is meant to do is use an alternate communications channel<br>
to remove the _other_ node, not the local one. And only one of them<br>
will win.<br></blockquote><div><br></div><div>I was expecting fencing to basically mean the same thing as in the old SAN sense of 'fencing off' a path to a device such that a surviving node can tell the SAN "shut out those node that's screwed up/don't allow it to write". In the apocalyptic case, I was using (perhaps abusing) this as a callout in the case where a drbd network dies. But I suppose that this would be the same scenario (if I crashed the nodes) as if there was a simultaneous power failure to both nodes.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> If there is a chance of having unreplicated/unacknowledged writes on two<br>
> different disks (those writes can't conflict, because the shared filesystem<br>
> wont write to the same blocks on both nodes simultaneously), is there a<br>
> resync option that will effectively 'revert' any unreplicated/unacknowledged<br>
> writes?<br>
<br>
</div>Yes, it's called the Activity Log, but you've got this part wrong as<br>
you're under an apparent misconception as to what the fencing handler<br>
should be doing.<br></blockquote><div><br></div><div>My impression of the fencing handler, with the 'resource-and-stonith' option selected is:</div><div>When a write can't be completed to the remote disk, immediately suspend all requests and call the provided fencing handler. If the fence handler returns 7, then continue on in standalone mode (well, that's what I've been intending to use it for).</div>
<div><br></div><div>The fence handler can/does get invoked on both nodes in primary/primary, though not necessarily both at the same time. It seems once either fs client/app issues a write to drbd, and it can't contact its peer, it invokes the fencing handler (which is what I want).</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> I am considering writing a test for this and would like to know a bit more<br>
> about what to expect before I do so.<br>
<br>
</div>Tell us what exactly you're trying to achieve please?<br></blockquote><div><br></div><div>My current state:</div><div>My current setup is such that drbd in primary/primary handles a node being disconnected from a cluster just fine (with a quorum indicating the surviving node). I've been able to recover from that (treating the surviving node as 'good' for continuity purposes). When the disconnected node reconnects, it has to become secondary and sync to the 'good' node, discarding, etc.</div>
<div><br></div><div>I was concerned that an apocalyptic outage (where everybody loses quorum), can be recovered from. I hadn't read up on the activity log before, but that's indeed what I was looking for. If there's a primary/primary setup and the whole cluster loses power, then each peer in the drbd device will rollback to a consistent point in the activity log?</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="HOEnZb"><font color="#888888"><br>
Florian<br>
<br>
--<br>
Need help with High Availability?<br>
<a href="http://www.hastexo.com/now" target="_blank">http://www.hastexo.com/now</a><br>
</font></span></blockquote></div><br>