Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, Oct 05, 2014 at 04:32:04PM -0600, Devin Reade wrote: > I have a pacemaker/corosync cluster on CentOS 5 that is running > drbd 8.3.15. The drbd build is from the CentOS Extras repository. > > This two-node cluster has been in production for quite a while > and it has been very stable. I recently applied updates from > CentOS 5.10 to CentOS 5.11. Upon reboot, the cluster came up > and drbd-overview showed that the three DRBD devices are properly > sync'd. However while attempting to failover the three resource > groups related to those three DRBD devices, I experienced a failure > in one of them with the error: > > Filesystem[9402]: ERROR: Couldn't find device [/dev/drbd/by-res/web]. > Expected /dev/??? to exist > > (that would be output from the FileSystem pacemaker resource agent) > > Looking a bit further, I found that even though drbd-overview was > reporting everything to be fine, I was not seeing all of the expected > symlinks created in the /dev/drbd/by-res/ directory. Examination > showed that across reboots I was getting either one or two of the > expected three symlinks created in that directory. I also was not > seeing anything identifiable in the other system logs regarding > a problem, such as running out of resources. > > Eventually, I was able to get things back into a sane state by: > 1. shutting down corosync on the problem node > 2. doing a 'chkconfig corosync off' > 3. rebooting the problem node > 4. *without* starting corosync, doing a 'drbdadm up DEVICE' on > each entry in drbd.conf > 5. on the working node, put the drbd master/slave sets into the > unmanaged state > 6. start corosync on the problem node > 7. bring the master/slave sets back into the managed state > 8. chkconfig corosync on > > I've rebooted the problem node enough times now that I'm reasonably > confident that whatever the cause of the problem was that it is no > longer occurring, and I've successfully failed over all services to > the formerly failing node. > > Despite being in a working state, I'd really prefer to know *why* > this was happening. Under what circumstances could we expect DRBD > to not create the /dev/drbd/by-res/ symlinks? It's not DRBD that creates those, it is udev. It may very well be "just" a timing issue. (read: maybe you want to add some "sleep" somewhere ... ) If this udev magic turns out to be misbehaving or unreliable for you, don't use it, but use the /dev/drbd[0-9] device nodes. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed