[DRBD-user] question on recovery from network failure on primary/primary

Thu Apr 5 20:34:31 CEST 2012

I have a shared/parallel filesystem on top of drbd dual primary/protocol C
(using 8.3.11 right now).

My question is about recovering after a network outage where I have a
'resource-and-stonith' fence handler which panics both systems as soon as
possible.

Even with Protocol-C, can the bitmaps still have dirty bits set? (ie,
different writes on each local device which haven't returned/acknowledged
to the shared filesystem because they haven't yet been written remotely?)

Maybe a more concrete example will make my question clearer:
- node A & B (2 node cluster) are operating nominally in primary/primary
mode (shared filesystem provides locking and prevents simultaneous write
access to the same blocks on the shared disk).
- node A: write to drbd device, block 234567, written locally, but remote
copy does not complete due to network failure
- node B: write to drbd device, block 876543, written locally, but remote
copy does not complete due to network failure
- Both writes do not complete and do not return successfully to the
filesystem (protocolC).
- Fencing handler is invoked, where I can suspend-io and/or panic both
nodes (since neither one is reliable at this point).

If there is a chance of having unreplicated/unacknowledged writes on two
different disks (those writes can't conflict, because the shared filesystem
wont write to the same blocks on both nodes simultaneously), is there a
resync option that will effectively 'revert' any
unreplicated/unacknowledged writes?

I am considering writing a test for this and would like to know a bit more
about what to expect before I do so.

Thanks,
Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120405/891bf9ab/attachment.htm>