[DRBD-user] split-brain on Ubuntu 14.04 LTS after reboot of master node

Sun Nov 15 20:12:57 CET 2015

On 15/11/15 02:07 PM, Ivan wrote:
> 
> 
> On 11/15/2015 07:53 PM, Digimer wrote:
>> On 15/11/15 11:36 AM, Ivan wrote:
>>>
>>>
>>> On 11/15/2015 05:04 PM, Digimer wrote:
>>>> On 15/11/15 05:03 AM, Waldemar Brodkorb wrote:
>>>>> Hi,
>>>>> Digimer wrote,
>>>>>>> property $id="cib-bootstrap-options" \
>>>>>>>           dc-version="1.1.10-42f2063" \
>>>>>>>           cluster-infrastructure="corosync" \
>>>>>>>           stonith-enabled="false" \
>>>>>>
>>>>>> And here's the core of the problem.
>>>>>>
>>>>>> Configure and test stonith in pacemaker. Then, configure drbd to use
>>>>>> 'fencing resource-and-stonith;' and configure
>>>>>> 'crm-{un,}fence-peer.sh as
>>>>>> the {un,}fence handlers.
>>>>>
>>>>> So stonith is a hard requirement even when a simple reboot is done?
>>>>>
>>>>> The docs is mentioning following:
>>>>> "The ocf:linbit:drbd OCF resource agent provides Master/Slave
>>>>> capability, allowing Pacemaker to start and monitor the DRBD
>>>>> resource on multiple nodes and promoting and demoting as needed. You
>>>>> must, however, understand that the drbd RA disconnects and detaches
>>>>> all DRBD resources it manages on Pacemaker shutdown, and also upon
>>>>> enabling standby mode for a node."
>>>>>
>>>>> http://drbd.linbit.com/users-guide-8.4/s-pacemaker-crm-drbd-backed-service.html
>>>>>
>>>>>
>>>>>
>>>>> So why demoting does not work when a reboot is done?
>>>>> When I do a simple crm node standby; sleep 30; crm node online
>>>>> everything is fine.
>>>>>
>>>>> best regards
>>>>>    Waldemar
>>>>
>>>> It's a hard requirement, period. Without it, debugging problems is a
>>>> waste of time because the cluster enters an undefined state. Fix
>>>> stonith, see if the issue remains, and if so, let us know.
>>>
>>> You're right that fencing should be set up for production clusters (in
>>> the sense that you take a huge data consistency risk not setting it up)
>>> but last time I did a test environment without stonith I could reboot a
>>> node without getting a pacemaker split-brain. Either things have changed
>>> from back then, or the OP is hitting another problem; maybe the reboot
>>> doesn't properly shut down pacemaker, or the network (link, firewall,
>>> ...) is torn down before pacemaker is stopped, ...
>>>
>>> cheers
>>> ivan
>>
>> It is entirely possible the issue is not related to fencing being
>> disabled. My point is that, without fencing, debugging becomes
>> sufficiently more complicated that it's not worth it. With fencing,
>> problems become much easier to find, in my experience.
>>
>> Also, you need fencing in all cases anyway, so why not use it from the
>> start? A "test cluster" that doesn't match what will become production
>> has limited use, wouldn't you agree?
>>
> 
> I fully agree. But a proper cluster setup is complicated enough that
> most people want to set up things step by step, and understand each step
> before moving on to the next one. In that case, stonith is a very
> important feature, but that's it, a feature, not a hard requirement,
> hence the existence of the stonith-enabled parameter. For instance the
> guys at clusterlabs disable stonith in their getting started doc [1]
> although there's a big warning explaining why.

I understand the rationale and disagree with it, strongly. The build
order should be;

1. Initial node config
2. Stonith
3. everything else

Stonith is not a feature, it is a requirement of any sane setup, full stop.

> Quoting the fine documentation:
> 
> "stonith-enable=false tells the cluster to simply pretend that failed
> nodes are safely powered off".

You can drive your car without seatbelts, too. You can pull the fuse on
your airbag system. You car still moves.

I have yet to see a use-case where disabling stonith is a valid setup.

> So if the OP is sure that its node has rebooted and doesn't access the
> shared storage (if any), then there must be a bug or a
> configuration/setup problem that fencing will just paper over when it'll
> kill the node. If enabling fencing solves the problem, that would be a
> bug too IMO. That said, you have way more experience than me, and maybe
> fencing will help in finding the cause of the problem.

Stonith is NOT only valid with shared storage, it is simply *more*
important. If you don't need to coordinate actions between nodes, you
don't need HA; Just run everything on all nodes and be done with it. If
that isn't possible (and it's not, even for "just a vip"), then you need
stonith.

> [1]
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch05.html

Yes, and the author indicated that the only reason it was disabled was
because he didn't want to get into a long discussion of all the
different methods of fencing. You will note though that there is a note
indicating how important stonith is, but few heed that.

I stand by my statement; Debugging without stonith *tested* and
*working* is a waste of time.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?