[DRBD-user] Possible DRBD Desync After Outage - Why?

Maros Timko timkom at gmail.com
Tue Jul 7 21:35:31 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


2009/7/7 Mike Sweetser - Adhost <mikesw at adhost.com>

>  Hello:
>
> We have two DRBD machines running RHEL 5.3 with DRBD 8.3.0.  Recently, we
> had an outage that took the primary server in the cluster down, leaving it
> to failover using DRBD and Heartbeat.  This was done with no issues.
>
> When the other server came back online, we initiated a manual resync as
> follows:
>
> drbdadm secondary RESOURCE
> drbdadm -- --discard-my-data connect RESOURCE
>
> Then from the live server, we did drbdadm connect RESOURCE, and it
> connected and resynced.
>
> Assuming all this was done right, we ran into other issues - some people
> have complained that their files have "reverted" to a previous state.  We
> don't show any errors occuring in the synchronization of the files, and
> never saw any "oos" in the DRBD status.
>
> So how could this have happened?  What can be done, outside of regular
> "drbdadm verify"s, to combat this problem?  And honestly, why is it
> necessary to do manual verification when file integrity of this nature
> should be a fundamental part of any file system duplication of this nature?
>

Because DRBD replicates data blocks - it does not care about filesystem on
top of it. Without cluster-aware filesystem it is not filesystem
duplication.
If you are using Xen on top of DRBD there could be some writes that get not
propagated to standby node. Look for threads on the list for details.

>
> I've attached my drbd.conf here - feel free to mention if I've done
> something stupid.
>
> resource r1 {
>   protocol C;
>   handlers {
>     pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' |
> wall; /etc/init.d/heartbeat stop"; #"halt -f";
>     pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall;
> /etc/init.d/heartbeat stop"; #"halt -f";
>   }
>
>   startup {
>     degr-wfc-timeout 120;    # 2 minutes.
>     wfc-timeout 120;    # 2 minutes.
>   }
>
>   disk {
>     no-disk-flushes;
>
This looks interesting. Are you sure you have battery-backed write cache?

Tino
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090707/8c8906e7/attachment.htm>


More information about the drbd-user mailing list