[DRBD-user] What happens after both nodes have gone down?

Fri Feb 13 17:12:28 CET 2009

Sorry, if my questions seemed to simplistic, I was
asking for confirmation about my understanding 
before making my suggestion for improvement below:

> The thing that people usually realised too late is 
> that default setting > for wait-for-connection setting 
> is 0 which means wait for peer connection forever or 
> manual intervention. Such setting will block your boot 
> up process in case DRBD service is started at boot up 
> time.

OK, so then in case of 0 (the default), my scenario
descriptions were correct, no?  What happens with the
drbd8 M/S OCFs, since they do not use the boot script?
Do they honor the wait-for-connection setting?

> Re 2A) No, if you set reasonable wait for connection 
> timeout interval.

But, what is reasonable?  Anything that you set risks
becoming split brain if it happens to be scenario
2B as you pointed out, right?  This means that the
default is currently the safe solution and anything
else is extremely risky right?

> Only node that has UpToDate data set can become 
> Primary

But the first node that comes up in either 2A or 2B, will 
have data "UpToDate" right?  So, in 2A it will be accurate, 
but in 2B it means split brain.  If "UpToDate" is trusted 
(thus the wait-for-connection I suppose)?

> So when one node goes down immediately, there 
> is no way to set its data as outdated.

This is what I was trying to verify since I do not like 
this part.  Again, I assume that is why the default wft 
is 0?  Setting it to any other value seems like asking 
for splitbrain.  I was wondering if any other solutions 
have been sought to this problem?

Would it not be possible after node B goes down to 
record that node B is outdated on node A (just because 
node B is unreachable does not mean that we do not 
have valuable information about the cluster status)?  
This way in scenario 2B) nothing would change, node 
B would continue to wait for node A before any node 
is promoted, but at least in scenario 2A), node A 
could take note that node B was previously down (and 
therefor should be considered outdated) and node A 
should be allowed to be promoted right away (or after 
a newly defined timer expires) without waiting 
for node B to come up?

If the cluster was degraded when node A went down, 
it should be able to continue to operate degraded 
safely when node A comes backup right?  Is there 
anything wrong with this logic?  Are there currently 
any mechanisms to do this?

Thanks,

-Martin