[DRBD-user] data loss?

Tue Aug 30 11:39:07 CEST 2005

/ 2005-08-30 07:54:47 +0800
\ Federico Sevilla III:
> On Mon, Aug 29, 2005 at 02:17:13PM -0400, Musard, Kris wrote:
> > I recently experienced some data loss with drbd.  I had a resource
> > "r0" which lost its connection about a month ago and went unnoticed.
> > This past weekend something caused both machines to reboot.  They are
> > both running heartbeat.  The machine with the older data started
> > heartbeat first and became primary.  A sync occurred causing the older
> > data to be copied to the other node.  In order to prevent this from
> > happening in the future I have put a script in place to monitor the
> > status of drbd and notify me when resources are not connected.  I also
> > set the "on-disconnect reconnect" parameter for all of my resources.
> > My question is what would have caused the older data to look newer to
> > drbd and cause the incorrect re-sync, and what additional steps can be
> > taken to prevent this from happening in the future?
> 
> At least on Debian GNU/Linux (I have not tried DRBD with the other major
> distributions, although this is probably a common trait), the DRBD
> initialization script waits for its partner to come online, or a manual
> administrative override at the console to bypass this wait, before it
> proceeds.
> 
> This process is also blocking,

but configurable, with timeouts.
I assume you are using drbd 0.7.

> which means Heartbeat and the other
> processes that are started after DRBD won't start until DRBD is done
> starting. This ensures that both machines are able to talk and agree on
> who has "newer" data, and how they should synchronize before other
> things like Heartbeat start and declare one or the other as primary.

the default timeout is: forever (wfc-timeout=0).
the default "degraded" timeout is: two minutes (degr-wfc-timeout=120).
"degraded" timeout is used when the node had not peer before the crash.

so in your case (drbd unconnected), it would have used the "degraded"
timeout.  in case the one with the "bad" data would boot faster, and the
heartbeat startup timeout is reached, too, and then heartbeat would
decide to make this node primary, it would increase the "timeout"
counter, which takes precedence.

only your logs could tell whether this in fact has happend.

to avoid at least this behaviour, you could increase the
degr-wfc-timeout, or set it to 0 (forever).

thinking about it, maybe we should use this "degraded" timeout only
if we have been _Primary_ without peer before the crash...

> AFAIK, as soon as both DRBD nodes have begun talking and know who has
> what, it shouldn't matter who gets flagged as primary and gets mounted
> somewhere.

right, at least for 0.7 and beyond.

the problem would be that they have not been able to talk to each other.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.