[DRBD-user] Consistent device to primary fences remote node

Thu Nov 27 15:24:51 CET 2008

On Thu, Nov 27, 2008 at 2:47 PM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
>> So basically the problem is that my outdate-peer doesn't try to
>> outdate the remote peer but it just fences it.
>> To fix this behaviour I could modify the outdate-peer handler to check
>> the DRBD_RESOURCE dstate and return the exit code 6 if local resource
>> is not "UpToDate".
>>
>> What do you think? Comments are welcome.
>
> if you return 6,
> drbd will (try to) outdate itself as a side effect.

This was what I was trying to accomplish. Basically my idea to avoid
split-brain is:

Scenario 1:

1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate
2) server 2 is correctly shut down
3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated
4) booting server 2 in StandAlone mode is impossible since it has Outdated data

Scenario 2:

1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate
2) server 2 is incorrectly shut down (fence/power loss), resource
remains in "Consistent" status (not "UpToDate")
3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown
4) booting server 2 in StandAlone mode is impossible since the
outdate-peer handler returns with exit code 6 when local resource is
not "UpToDate".

Basically the Consistent status will always be turned into Outdated
because I have no way to check if the remote peer is primary. I assume
that a peer with "Consistent" status was incorrectly shut down and
can't become primary without manual intervention.

If both nodes are incorrectly shut down they both end in "Consistent"
status. At the next boot they'll both outdate their local resource and
manual intervention is required to choose the most updated resource.

What do you think? Comments are welcome.

> why don't you just set a high initial wait for connection timeout?
>   wfc-timeout 172800;
> if within two days no-one came and told me that I'm outdated,
> and I still cannot reach the other node, I have all right to assume I'm
> the only survivor and allowed to become primary.

I don't like the idea of a server waiting for a couple of days in the
boot sequence as a general rule and in this particular situation even
more since I moved the drbd script early at the beginning  before
clvmd.
Stopping the booting sequence for 2 days means I wouldn't be able to
remotely log in.

-- 
Federico.