[DRBD-user] DRBD fails to bring resources up with ~90 resources operational

Tue Mar 1 16:10:40 CET 2016

Hello,

I have a pair of Xen hosts, and the guests are pacemaker resources each
of which has its own underlying DRBD storage device. I have drbd version
8.4.5 [0], and "minor-count 200;" in global_common.conf.[1]

Last night, attempting to create a new DRBD resource failed, the error
being (having passed through a few layers of reporting):

"drbd.d/mws-priv-95.res:7: in resource mws-priv-95, on agogue:
	IP fd19:1b70:f7a6:1ae5::8d:6 not found on this host."

This is incorrect, and I can verify that

fd191b70f7a61ae500000000008d0006 03 40 00 80 eth1

appears in /proc/net/if_inet6. But perhaps this is all a red herring?

Then today the cluster tried to migrate some guests[2] and all hell
broke loose with xenstore unable to talk to the block devices any more,
and drbdadm failed to be able to bring /any/ devices up or into the
primary state, again complaining about missing IP. The output of
strace -fvy -s 512 drbdadm up mws-priv-18

is at http://www.chiark.greenend.org.uk/~matthewv/junk/drbdupstrace

[that's the other Xen host, and it has, in /proc/net/if_inet6:
fd191b70f7a61ae500000000008d0007 03 40 00 80     eth1
]

This is a pretty serious problem, and I could only resolve it by
rebooting both guests in turn. Any ideas of how to debug it if it
happens again, resolve it without rebooting, or ideally stop it
happening again? 90 drbd devices doesn't seem like it should be too many...

Regards,

Matthew

[0] top of /proc/drbd -
version: 8.4.5 (api:1/proto:86-101)
srcversion: 5A4F43804B37BB28FCB1F47

[1] I appreciate that this should be unnecessary, but it seemed to help
when I saw a similar issue in the past (see thread title "Misleading
error messages from drbdadm up (IP not found on this host)" from 27 Jan

[2] example xen logging of failed migration - you can see xen having
problems with the backing store
http://www.chiark.greenend.org.uk/~matthewv/junk/migration-fail.txt