Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 15/11/15 02:07 PM, Ivan wrote: > > > On 11/15/2015 07:53 PM, Digimer wrote: >> On 15/11/15 11:36 AM, Ivan wrote: >>> >>> >>> On 11/15/2015 05:04 PM, Digimer wrote: >>>> On 15/11/15 05:03 AM, Waldemar Brodkorb wrote: >>>>> Hi, >>>>> Digimer wrote, >>>>>>> property $id="cib-bootstrap-options" \ >>>>>>> dc-version="1.1.10-42f2063" \ >>>>>>> cluster-infrastructure="corosync" \ >>>>>>> stonith-enabled="false" \ >>>>>> >>>>>> And here's the core of the problem. >>>>>> >>>>>> Configure and test stonith in pacemaker. Then, configure drbd to use >>>>>> 'fencing resource-and-stonith;' and configure >>>>>> 'crm-{un,}fence-peer.sh as >>>>>> the {un,}fence handlers. >>>>> >>>>> So stonith is a hard requirement even when a simple reboot is done? >>>>> >>>>> The docs is mentioning following: >>>>> "The ocf:linbit:drbd OCF resource agent provides Master/Slave >>>>> capability, allowing Pacemaker to start and monitor the DRBD >>>>> resource on multiple nodes and promoting and demoting as needed. You >>>>> must, however, understand that the drbd RA disconnects and detaches >>>>> all DRBD resources it manages on Pacemaker shutdown, and also upon >>>>> enabling standby mode for a node." >>>>> >>>>> http://drbd.linbit.com/users-guide-8.4/s-pacemaker-crm-drbd-backed-service.html >>>>> >>>>> >>>>> >>>>> So why demoting does not work when a reboot is done? >>>>> When I do a simple crm node standby; sleep 30; crm node online >>>>> everything is fine. >>>>> >>>>> best regards >>>>> Waldemar >>>> >>>> It's a hard requirement, period. Without it, debugging problems is a >>>> waste of time because the cluster enters an undefined state. Fix >>>> stonith, see if the issue remains, and if so, let us know. >>> >>> You're right that fencing should be set up for production clusters (in >>> the sense that you take a huge data consistency risk not setting it up) >>> but last time I did a test environment without stonith I could reboot a >>> node without getting a pacemaker split-brain. Either things have changed >>> from back then, or the OP is hitting another problem; maybe the reboot >>> doesn't properly shut down pacemaker, or the network (link, firewall, >>> ...) is torn down before pacemaker is stopped, ... >>> >>> cheers >>> ivan >> >> It is entirely possible the issue is not related to fencing being >> disabled. My point is that, without fencing, debugging becomes >> sufficiently more complicated that it's not worth it. With fencing, >> problems become much easier to find, in my experience. >> >> Also, you need fencing in all cases anyway, so why not use it from the >> start? A "test cluster" that doesn't match what will become production >> has limited use, wouldn't you agree? >> > > I fully agree. But a proper cluster setup is complicated enough that > most people want to set up things step by step, and understand each step > before moving on to the next one. In that case, stonith is a very > important feature, but that's it, a feature, not a hard requirement, > hence the existence of the stonith-enabled parameter. For instance the > guys at clusterlabs disable stonith in their getting started doc [1] > although there's a big warning explaining why. I understand the rationale and disagree with it, strongly. The build order should be; 1. Initial node config 2. Stonith 3. everything else Stonith is not a feature, it is a requirement of any sane setup, full stop. > Quoting the fine documentation: > > "stonith-enable=false tells the cluster to simply pretend that failed > nodes are safely powered off". You can drive your car without seatbelts, too. You can pull the fuse on your airbag system. You car still moves. I have yet to see a use-case where disabling stonith is a valid setup. > So if the OP is sure that its node has rebooted and doesn't access the > shared storage (if any), then there must be a bug or a > configuration/setup problem that fencing will just paper over when it'll > kill the node. If enabling fencing solves the problem, that would be a > bug too IMO. That said, you have way more experience than me, and maybe > fencing will help in finding the cause of the problem. Stonith is NOT only valid with shared storage, it is simply *more* important. If you don't need to coordinate actions between nodes, you don't need HA; Just run everything on all nodes and be done with it. If that isn't possible (and it's not, even for "just a vip"), then you need stonith. > [1] > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch05.html Yes, and the author indicated that the only reason it was disabled was because he didn't want to get into a long discussion of all the different methods of fencing. You will note though that there is a note indicating how important stonith is, but few heed that. I stand by my statement; Debugging without stonith *tested* and *working* is a waste of time. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?