[DRBD-user] 0.7.2 Split-Brain, unrecoverable - update-reconstruction [user error]

Lars Ellenberg Lars.Ellenberg at linbit.com
Thu Aug 19 12:23:52 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2004-08-19 09:26:05 +0200
\ Alex Ongena:
> > and something with your init script order is not working properly.
> > it first cuts the network, while drbd is still up and running.
> > you need to *first* make a possible Primary drbd Secondary,
> > or completely stop drbd, and only *then* shutdown the network.
> 
> I know, but I tried to figure out how robust the driver was
> against 'less than normal' shutdowns, possibly caused by faulty HD's
> where the proper shutdown scripts are corrupted...
> 
> I'am _not_ simulating the _normal_ situations, but those that I
> have seen that can happen.
> 
> Nevertheless, I think it should be possible to recover from a
> split-brain situation by just using drbdsetup command's.
> Now, the only way to recover is to manualy corrupt the,
> drbd-meta storage.

no.
situation is:
 one node is Primary, other is Secondary, both go to standalone,
 because they detect some previously occurred splitbrain.

 you make the Primary one Secondary. 

 now, both are Secondary, Standalone.
 if you connect now, sync will start in *some* direction.

 if you don't won't "some", but *the right* direction,
 first make the node with the good data primary by using the --human
 flag of drbdsetup primary.
 now the good data one is primary, and has the better generation counts.
 still both are Standalone. you now connect them. sync will start in the
 right direction.

> > just out of curiosity,
> > what are you using as cluster manager?
> heartbeat
> 
> > why don't you use heartbeat?
> I am

ok.

the drbd init script should not be called by heartbeat at all anymore.
it is not a resource, it starts a driver. the resource is the drbddisk!
and your logs look like you have drbd as resource.
this seems strange. but maybe I am misreading the logs.

in general, the drbddisk status and drbddisk start operation maybe needs
to be improved to somehow account for the drbd internal "StandAlone"
connection state... I'm not yet sure how to do this properly. or rather,
there are more than one way to do this, and some prefer data
consistency, and some prefer availability...

> > > PS: let me know if I can help with further testing to make
> > > 0.7.x rock-solid.
> > 
> > for now, just use it the right way :)
> I do, but I also like drbd to be robust when used the wrong way
> (caused by script failures, errors, bad HD's, ....)

have a look at the scripts and perl modules under testing/CTH, if you
like. though this test scripts are still beta code, and can not do all
things yet that they pretend to be able to.


	Lars Ellenberg


btw, your MUA keeps sending me a private copy, too.  there is a
difference between reply, group reply, and list reply,
even in ximian evolution...
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list