Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Nov 27, 2008 at 2:47 PM, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: >> So basically the problem is that my outdate-peer doesn't try to >> outdate the remote peer but it just fences it. >> To fix this behaviour I could modify the outdate-peer handler to check >> the DRBD_RESOURCE dstate and return the exit code 6 if local resource >> is not "UpToDate". >> >> What do you think? Comments are welcome. > > if you return 6, > drbd will (try to) outdate itself as a side effect. This was what I was trying to accomplish. Basically my idea to avoid split-brain is: Scenario 1: 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate 2) server 2 is correctly shut down 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated 4) booting server 2 in StandAlone mode is impossible since it has Outdated data Scenario 2: 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate 2) server 2 is incorrectly shut down (fence/power loss), resource remains in "Consistent" status (not "UpToDate") 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown 4) booting server 2 in StandAlone mode is impossible since the outdate-peer handler returns with exit code 6 when local resource is not "UpToDate". Basically the Consistent status will always be turned into Outdated because I have no way to check if the remote peer is primary. I assume that a peer with "Consistent" status was incorrectly shut down and can't become primary without manual intervention. If both nodes are incorrectly shut down they both end in "Consistent" status. At the next boot they'll both outdate their local resource and manual intervention is required to choose the most updated resource. What do you think? Comments are welcome. > why don't you just set a high initial wait for connection timeout? > wfc-timeout 172800; > if within two days no-one came and told me that I'm outdated, > and I still cannot reach the other node, I have all right to assume I'm > the only survivor and allowed to become primary. I don't like the idea of a server waiting for a couple of days in the boot sequence as a general rule and in this particular situation even more since I moved the drbd script early at the beginning before clvmd. Stopping the booting sequence for 2 days means I wouldn't be able to remotely log in. -- Federico.