[DRBD-user] UUIDs not being rotated after completed sync

Fri Nov 18 21:32:48 CET 2011

Lars,

note that the following issue description is from a DRBD setup that is
well outside of design specs. This is from an IRC discussion with Paul
Hedderly (CC'd, "prh" on freenode). Debian sid box running 8.3.9
userland against a stock 3.1 kernel drbd.ko (equiv. DRBD 8.3.11), with
protocol A over a DSL link, where the user has said he doesn't want to
afford DRBD Proxy. So it's pretty much a symphony of "don't do that",
but I'm just posting the issue here in case it's an underlying deeper
problem that you want to investigate. Please, by all means, feel free to
pass on this if you're busy.

This is the DRBD config for his misbehaving device, /dev/drbd12:

disk {
	size			0s _is_default; # bytes
	on-io-error		detach;
	fencing 		dont-care _is_default;
	max-bio-bvecs		0 _is_default;
}
net {
	timeout 		150; # 1/10 seconds
	max-epoch-size		2048 _is_default;
	max-buffers		1024;
	unplug-watermark	128 _is_default;
	connect-int		20; # seconds
	ping-int		20; # seconds
	sndbuf-size		0 _is_default; # bytes
	rcvbuf-size		0 _is_default; # bytes
	ko-count		0 _is_default;
	after-sb-0pri		discard-least-changes;
	after-sb-1pri		discard-secondary;
	after-sb-2pri		disconnect _is_default;
	rr-conflict		disconnect _is_default;
	ping-timeout		20; # 1/10 seconds
}
syncer {
	rate			250k _is_default; # bytes/second
	after			11;
	al-extents		127 _is_default;
	csums-alg		"sha1";
	verify-alg		"sha1";
	on-no-data-accessible	io-error _is_default;
	c-plan-ahead		30; # 1/10 seconds
	c-delay-target		10 _is_default; # 1/10 seconds
	c-fill-target		32s; # bytes
	c-max-rate		800k; # bytes/second
	c-min-rate		1024k; # bytes/second
}
protocol A;
_this_host {
	device			minor 12;
	disk			"/dev/mew88/drbd_x2";
	flexible-meta-disk	"/dev/mew88/dmeta_x2";
	address 		ipv4 10.88.8.167:2782;
}
_remote_host {
	address 		ipv4 10.88.2.2:7782;
}
# (81)	    unknown tag = (integer) 0	[len: 4]
# (82)	    unknown tag = (integer) 0	[len: 4]
# (83)	    unknown tag = (integer) 127 	[len: 4]
# Found unknown tags, you should update your
# userland tools

Now, what he is reporting is that anytime that device is being
disconnected, it does a full sync. And judging by the logs that seems to
indeed be true:

Nov 18 04:52:39 mew88 kernel: [167671.867930] block drbd12:
drbd_sync_handshake:
Nov 18 04:52:39 mew88 kernel: [167671.867945] block drbd12: self
002A000000000000:0000000000000000:0000000000000000:0000000000000000
bits:1888384 flags:0
Nov 18 04:52:39 mew88 kernel: [167671.867959] block drbd12: peer
0000000000000005:002A000000000000:0029000000000000:0028000000000000
bits:1888584 flags:2
Nov 18 04:52:39 mew88 kernel: [167671.867971] block drbd12:
uuid_compare()=2 by rule 30
Nov 18 04:52:39 mew88 kernel: [167671.867978] block drbd12: Becoming
sync target due to disk states.

In the above the sync decision, AFAICT, is fine based on the UUIDs. The
sync then actually does complete, hours later, which again is expected
based on the amazingly slow link:

Nov 18 04:52:41 mew88 kernel: [167673.294137] block drbd12: Began resync
as SyncTarget (will sync 20971520 KB [5242880 bits set]).
Nov 18 11:21:52 mew88 kernel: [191024.287840] block drbd12: Resync done
(total 23350 sec; paused 0 sec; 896 K/sec)

But then, immediately after that, this:

Nov 18 11:21:52 mew88 kernel: [191024.287856] block drbd12: 65 % had
equal checksums, eliminated: 13698048K; transferred 7273472K total 20971520K
Nov 18 11:21:52 mew88 kernel: [191024.287875] block drbd12: updated
UUIDs 0000000000000004:0000000000000000:002B000000000000:002A000000000000

So if I understand correctly, it's updating the current UUID with
UUID_JUST_CREATED. Hrm.

The link then breaks again, a few hours later:

Nov 18 15:06:29 mew88 kernel: [204501.759578] block drbd12: peer(
Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate ->
 DUnknown )
Nov 18 15:06:29 mew88 kernel: [204501.770208] block drbd12: meta
connection shut down by peer.

And when it is reinstated a couple of minutes later, sure enough, rule
20 triggers again:

Nov 18 15:08:50 mew88 kernel: [204642.358228] block drbd12:
drbd_sync_handshake:
Nov 18 15:08:50 mew88 kernel: [204642.358243] block drbd12: self
0000000000000004:0000000000000000:002B000000000000:002A000000000000
bits:0 flags:0
Nov 18 15:08:50 mew88 kernel: [204642.358257] block drbd12: peer
38A8B488B7060CB3:0000000000000005:002B000000000000:002A000000000000
bits:573 flags:0
Nov 18 15:08:50 mew88 kernel: [204642.358269] block drbd12:
uuid_compare()=-2 by rule 20
Nov 18 15:08:50 mew88 kernel: [204642.358277] block drbd12: Writing the
whole bitmap, full sync required after drbd_sync_handshake.

Da capo al fine.

Is there a plausible explanation for that odd current UUID? I have a
full kernel log if that is helpful.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now