<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7654.12">
<TITLE>Possible DRBD Desync After Outage - Why?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Hello:<BR>
<BR>
We have two DRBD machines running RHEL 5.3 with DRBD 8.3.0. Recently, we had an outage that took the primary server in the cluster down, leaving it to failover using DRBD and Heartbeat. This was done with no issues. <BR>
<BR>
When the other server came back online, we initiated a manual resync as follows:<BR>
<BR>
drbdadm secondary RESOURCE<BR>
drbdadm -- --discard-my-data connect RESOURCE<BR>
<BR>
Then from the live server, we did drbdadm connect RESOURCE, and it connected and resynced.<BR>
<BR>
Assuming all this was done right, we ran into other issues - some people have complained that their files have "reverted" to a previous state. We don't show any errors occuring in the synchronization of the files, and never saw any "oos" in the DRBD status. <BR>
<BR>
So how could this have happened? What can be done, outside of regular "drbdadm verify"s, to combat this problem? And honestly, why is it necessary to do manual verification when file integrity of this nature should be a fundamental part of any file system duplication of this nature?<BR>
<BR>
I've attached my drbd.conf here - feel free to mention if I've done something stupid.<BR>
<BR>
resource r1 {<BR>
protocol C;<BR>
handlers {<BR>
pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";<BR>
pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";<BR>
}<BR>
<BR>
startup {<BR>
degr-wfc-timeout 120; # 2 minutes.<BR>
wfc-timeout 120; # 2 minutes.<BR>
}<BR>
<BR>
disk {<BR>
no-disk-flushes;<BR>
on-io-error detach;<BR>
}<BR>
<BR>
net {<BR>
timeout 120;<BR>
connect-int 20;<BR>
ping-int 20;<BR>
max-buffers 2048;<BR>
max-epoch-size 2048;<BR>
ko-count 30;<BR>
cram-hmac-alg "sha1";<BR>
shared-secret "******";<BR>
after-sb-0pri disconnect;<BR>
after-sb-1pri disconnect;<BR>
after-sb-2pri disconnect;<BR>
}<BR>
<BR>
syncer {<BR>
rate 30M;<BR>
al-extents 503;<BR>
verify-alg sha1;<BR>
}<BR>
<BR>
on SERVER1 {<BR>
device /dev/drbd1;<BR>
disk /dev/sda7;<BR>
address 172.16.2.1:7789;<BR>
meta-disk /dev/sda6[1];<BR>
}<BR>
<BR>
on SERVER2 {<BR>
device /dev/drbd1;<BR>
disk /dev/sda7;<BR>
address 172.16.2.2:7789;<BR>
meta-disk /dev/sda6[1];<BR>
}<BR>
}</FONT>
</P>
</BODY>
</HTML>