Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Nov 27, 2008 at 03:24:51PM +0100, Federico Simoncelli wrote: > On Thu, Nov 27, 2008 at 2:47 PM, Lars Ellenberg > <lars.ellenberg at linbit.com> wrote: > >> So basically the problem is that my outdate-peer doesn't try to > >> outdate the remote peer but it just fences it. > >> To fix this behaviour I could modify the outdate-peer handler to check > >> the DRBD_RESOURCE dstate and return the exit code 6 if local resource > >> is not "UpToDate". > >> > >> What do you think? Comments are welcome. > > > > if you return 6, > > drbd will (try to) outdate itself as a side effect. > > This was what I was trying to accomplish. Basically my idea to avoid > split-brain is: > > Scenario 1: > > 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate > 2) server 2 is correctly shut down > 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated > 4) booting server 2 in StandAlone mode is impossible since it has Outdated data careful. you use your own outdate-peer handler. so, server-1 "knows". but does server-2 know that it is outdated? who outdated it? when? > Scenario 2: > > 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate > 2) server 2 is incorrectly shut down (fence/power loss), resource > remains in "Consistent" status (not "UpToDate") > 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown > 4) booting server 2 in StandAlone mode is impossible since the > outdate-peer handler returns with exit code 6 when local resource is > not "UpToDate". > > Basically the Consistent status will always be turned into Outdated > because I have no way to check if the remote peer is primary. I assume > that a peer with "Consistent" status was incorrectly shut down and > can't become primary without manual intervention. you can now no longer reboot a single primary, whether cleanly or by power reset. because that assumption is "wrong": Both Outdated and UpToDate are sub aspects of Consistent. if no drbd fencing policy is configured ("DontCare"?), drbd assumes Consistent == UpToDate. but if there is some drbd fencing policy configured, then only the drbd handshake, or the outdate-peer handler via exit code can disambiguate. > If both nodes are incorrectly shut down they both end in "Consistent" > status. At the next boot they'll both outdate their local resource and > manual intervention is required to choose the most updated resource. > > What do you think? Comments are welcome. > > > why don't you just set a high initial wait for connection timeout? > > wfc-timeout 172800; > > if within two days no-one came and told me that I'm outdated, > > and I still cannot reach the other node, I have all right to assume I'm > > the only survivor and allowed to become primary. > > I don't like the idea of a server waiting for a couple of days in the > boot sequence as a general rule and in this particular situation even > more since I moved the drbd script early at the beginning before > clvmd. > Stopping the booting sequence for 2 days means I wouldn't be able to > remotely log in. of course network and sshd have to be up first. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed