[DRBD-user] drbd resource ahead / behind problem

Fri Mar 16 17:37:07 CET 2012

Well, this is still happening, and with any resource that is used much at
all.  I moved some vms from the problematic resource and it started
happening on the new resource.  I could really use some help in trying to
figure out what in the world is going on, as this makes our dr node pretty
useless.

I've been pouring over the logs and have seen some troubling messages, but I
don't know if they are really a problem or just normal.

I see these a LOT:
kern.info<6>: Mar 16 11:22:05 openfiler2 kernel: block drbd15: conn( Ahead
-> SyncSource ) pdsk( Outdated -> Inconsistent )
kern.info<6>: Mar 16 11:22:05 openfiler2 kernel: block drbd15: Began resync
as SyncSource (will sync 748 KB [187 bits set]).

... 

kern.warn<4>: Mar 16 11:22:12 openfiler2 kernel: block drbd15: cs:Ahead
rs_left=465 > rs_total=187 (rs_failed 0)

Which, from what I can tell seem to indicate that there is more data out of
sync than the driver expected, is that correct?  what causes that and should
I be concerned?

Get these quite often too, both between the two local nodes and between the
local and offsite node:
kern.warn<4>: Mar 16 11:22:06 openfiler2 kernel: block drbd5: Digest
mismatch, buffer modified by upper layers during write: 331859168s +4096

Sometimes these cause the connection to be dropped and re-established, and
sometimes not, here is a case where it did:

kern.warn<4>: Mar 16 11:26:06 openfiler2 kernel: block drbd15: Digest
mismatch, buffer modified by upper layers during write: 330987176s +4096
kern.err<3>: Mar 16 11:26:06 openfiler2 kernel: block drbd15: meta
connection shut down by peer.

I'm also seeing these:
kern.warn<4>: Mar 16 10:23:01 openfiler2 kernel: block drbd4: helper
command: /sbin/drbdadm fence-peer minor-4 exit code 1 (0x100)
kern.err<3>: Mar 16 10:23:01 openfiler2 kernel: block drbd4: fence-peer
helper broken, returned 1

and when the nodes get into the state where the remote node believes it's
uptodate, but isn't i see these on the local node:

kern.warn<4>: Mar 16 10:31:05 openfiler2 kernel: block drbd14: Local backing
block device frozen?

I've found that if I force the network down on the remote node like this:
ifdown eth0;sleep 15;ifup eth0
drbdadm connect all

then the nodes will reconnect and not resync all the data on the drives, but
that is very messy and scares the crap out of me.

Insights, thoughts, anything??

envisionrx wrote:
> 
> This seems to be a problem with this one resource, the other resources
> aren't exhibiting this issue.  every time I disconnect and reconnect and
> then let it sync it finishes and is left in this weird state.  Migrating
> files off this resource while I dig more into the logs for a clue as to
> what might be up.  I guess I'll rebuild the resource after I get
> everything off of it to see if that clears it up.  If any one has any
> suggestions in the mean time please let me know!
> 

-- 
View this message in context: http://old.nabble.com/drbd-resource-ahead---behind-problem-tp33454636p33518386.html
Sent from the DRBD - User mailing list archive at Nabble.com.