[DRBD-user] Current Primary shall become sync TARGET! Aborting to prevent data corruption.

Wed Mar 2 16:13:11 CET 2005

/ 2005-02-19 23:15:15 +1100
\ Jonathan Trott:
> The situation:
> 
> drbd 0.7.10
> Fedora Core 1
> kernel 2.4.26
> 
> Osacon2 is the active member of the heartbeat based cluster. A STONITH 
> event is triggered by unplugging both heartbeat cables (including the 
> drbd sync cable). The STONITH is successful in failing over the cluster 
> and osacon1 becomes the active member of the cluster. When osacon2 
> loads the drbd service during the boot process, osacon1 refuses the 
> connection and goes into StandAlone cstate. This causes the cluster to 
> no longer have redundancy. To fix this requires a drbdsetup /dev/drbd0 
> net command to be run on osacon1 and the drbd service restarted on 
> osacon2. Not a very automated process.
> 
> The question:
> Why does drbd come up and error out and be left in standalone cstate? 
> Shouldn't the state of drbd on osacon2 be secondary as it loads and 
> therefore not cause an error when it tries to sync with osacon1? Is 
> there some way to avoid this event in this scenario? It is completely 
> reproducible and severely degrades the redundancy of this cluster.
> 
> If the drbd sync cable is unplugged, then re-plugged in a minute later 
> there is no problems with re-establishing the drbd connection. The 
> problem only occurs if the Primary is rebooted and before it comes back 
> online the other node becomes the Primary.
> 
> The following logs are from the event where the drbd service stars on 
> osacon2 after the STONITH event.

probably your heartbeat timeout is shorter than the DRBD timeout.
(net timeout and maybe ping-int settings)

the logs will show this (you should include drbd and heartbeat relevant
entries starting from _prior_ to unplugging the cables, including the
last drbd event before that
  (drbdX: I am([PS]) <some numbers>...  Peer... <some numbers>)

I suggest you increase the heartbeat timeout (deadtime) to be larger
than the drbd timeout. you may reduce the drbd timeout from the default
10 seconds to something much lower, though. but during heavy network -
and IO load you might get connection-loss -- connection-established
cycles, if you make it too small.

yes, relying on timeouts is not exactly robust...
but, it is the best you can do if you completely lose communications.

we plan to include some more hooks for the cluster manager to notify
drbd about certain events, and vice versa, to better synchronise their
views of the world...
but that is more drbd 0.8 heartbeat 2.0 (or later) stuff...

	Lars Ellenberg