[DRBD-user] question on recovery from network failure on primary/primary

Thu Apr 5 22:07:29 CEST 2012

On Thu, Apr 5, 2012 at 11:53 AM, Florian Haas <florian at hastexo.com> wrote:

> On Thu, Apr 5, 2012 at 8:34 PM, Brian Chrisman <brchrisman at gmail.com>
> wrote:
> > I have a shared/parallel filesystem on top of drbd dual primary/protocol
> C
> > (using 8.3.11 right now).
>
> _Which_ filesystem precisely?
>

I'm testing this with GPFS.

>
> > My question is about recovering after a network outage where I have a
> > 'resource-and-stonith' fence handler which panics both systems as soon as
> > possible.
>
> Self-fencing is _not_ how a resource-and-stonith fencing handler is
> meant to operate.
>

I'm not concerned about basic disconnect where I can use a tie breaker
setup (I do have a fencing setup which looks like it handles that just
fine, ie, selecting the 'working' node --defined by cluster membership-- to
continue/resume IO).  I'm talking about something more apocalyptic where
both nodes can't contact a tie breaker.  At this point I don't care about
having a node continue operations, I just want to make sure there's no data
corruption.

>
> > Even with Protocol-C, can the bitmaps still have dirty bits set? (ie,
> > different writes on each local device which haven't
> returned/acknowledged to
> > the shared filesystem because they haven't yet been written remotely?)
>
> The bitmaps only apply to background synchronization. Foreground
> replication does not use the quick-sync bitmap.
>

I was reading in the documentation that when a disconnect event occurred,
there's a UUID-shuffle where the 'current' -> 'bitmap' -> historic... and
'new' becomes 'current'.  Is that the scheme we're discussing that's only
applicable to background sync?

>
> > Maybe a more concrete example will make my question clearer:
> > - node A & B (2 node cluster) are operating nominally in primary/primary
> > mode (shared filesystem provides locking and prevents simultaneous write
> > access to the same blocks on the shared disk).
> > - node A: write to drbd device, block 234567, written locally, but remote
> > copy does not complete due to network failure
> > - node B: write to drbd device, block 876543, written locally, but remote
> > copy does not complete due to network failure
>
> Makes sense up to here.
>
> > - Both writes do not complete and do not return successfully to the
> > filesystem (protocolC).
>
> You are aware that "do not return successfully" means that no
> completion is signaled, which is correct, but not that non-completion
> is signaled, which would be incorrect?
>

Yeah, I suppose there are a whole host of issues with this in regard to
sync/async writes, but my expectation was that a synchronous call would
hang.

>
> > - Fencing handler is invoked, where I can suspend-io and/or panic both
> nodes
> > (since neither one is reliable at this point).
>
> "Panicking" a node is pointless, and panicking both is even worse.
> What fencing is meant to do is use an alternate communications channel
> to remove the _other_ node, not the local one. And only one of them
> will win.
>

I was expecting fencing to basically mean the same thing as in the old SAN
sense of 'fencing off' a path to a device such that a surviving node can
tell the SAN "shut out those node that's screwed up/don't allow it to
write".  In the apocalyptic case, I was using (perhaps abusing) this as a
callout in the case where a drbd network dies.  But I suppose that this
would be the same scenario (if I crashed the nodes) as if there was a
simultaneous power failure to both nodes.

>
> > If there is a chance of having unreplicated/unacknowledged writes on two
> > different disks (those writes can't conflict, because the shared
> filesystem
> > wont write to the same blocks on both nodes simultaneously), is there a
> > resync option that will effectively 'revert' any
> unreplicated/unacknowledged
> > writes?
>
> Yes, it's called the Activity Log, but you've got this part wrong as
> you're under an apparent misconception as to what the fencing handler
> should be doing.
>

My impression of the fencing handler, with the 'resource-and-stonith'
option selected is:
When a write can't be completed to the remote disk, immediately suspend all
requests and call the provided fencing handler.  If the fence handler
returns 7, then continue on in standalone mode (well, that's what I've been
intending to use it for).

The fence handler can/does get invoked on both nodes in primary/primary,
though not necessarily both at the same time.  It seems once either fs
client/app issues a write to drbd, and it can't contact its peer, it
invokes the fencing handler (which is what I want).

>
> > I am considering writing a test for this and would like to know a bit
> more
> > about what to expect before I do so.
>
> Tell us what exactly you're trying to achieve please?
>

My current state:
My current setup is such that drbd in primary/primary handles a node being
disconnected from a cluster just fine (with a quorum indicating the
surviving node).  I've been able to recover from that (treating the
surviving node as 'good' for continuity purposes).  When the disconnected
node reconnects, it has to become secondary and sync to the 'good' node,
discarding, etc.

I was concerned that an apocalyptic outage (where everybody loses quorum),
can be recovered from.  I hadn't read up on the activity log before, but
that's indeed what I was looking for.  If there's a primary/primary setup
and the whole cluster loses power, then each peer in the drbd device will
rollback to a consistent point in the activity log?

>
> Florian
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120405/59359a12/attachment.htm>