[Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem

Tue Aug 24 00:01:47 CEST 2004

On 2004-08-20T14:52:52,
   Philipp Reisner <philipp.reisner at linbit.com> said:

> The situation:
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.

Big fat bug in the setup and in the cluster manager. ;-) Thus, while it
must be resolveable, it doesn't need to be resolved efficiently.

>  X     X   Both nodes down.
>  P --- S   The current behaviour. 
> 
> What should be done after Split brain ? 

Both sides should detect this and by default refuse to connect until a
human (or higher up being such as the cluster manager) interferes and
explicitly and force-fully demotes one side to secondary again.

> The question are:
> Should this policy be configurable ? (IMO: yes)
> Which policies do we want to offer ?
> 
>  * The node that was primary before split brain (current behaviour)
>  * The node that becaume primary during split brain 
>  * The node that modified more of it's data during the split-brain
>    situation  [ Do not think about implementation yet, just about
>                 the policy ]
>  * others ?...

See above. None of your three choices seems the safe answer, because it
will need an admin to sort out which side really has the 'better' data,
or even worse, may require an image to be taken of both sides and the
changes merged.

> The second question to answer is:
> What should we do if the connecting network heals ? I.e.
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.

(Comment about broken setup applies again.)

>  ? --- ?   What now ?
> 
> Current policy: The two nodes will refuse to connect. The administrator
>                 has to resove this.
> 
> Are there any other policies that would make sense ?

This is the best solution I can think of for the above reasons. As there
may be higher level services running on both nodes, you can't
(internally to drbd) resolve this. The higher level services need to be
stopped, and one side explicitly demoted. Or both demoted and one
explicitly promoted, which should come out the same.

Mit freundlichen Grüßen,
    Lars Marowsky-Brée <lmb at suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /