[DRBD-user] /dev/drbd/by-res symlinks not getting created

Mon Oct 6 17:19:18 CEST 2014

On Sun, Oct 05, 2014 at 04:32:04PM -0600, Devin Reade wrote:
> I have a pacemaker/corosync cluster on CentOS 5 that is running
> drbd 8.3.15.  The drbd build is from the CentOS Extras repository.
> 
> This two-node cluster has been in production for quite a while
> and it has been very stable.  I recently applied updates from
> CentOS 5.10 to CentOS 5.11.  Upon reboot, the cluster came up
> and drbd-overview showed that the three DRBD devices are properly
> sync'd.  However while attempting to failover the three resource
> groups related to those three DRBD devices, I experienced a failure
> in one of them with the error:
> 
> Filesystem[9402]: ERROR: Couldn't find device [/dev/drbd/by-res/web].
> Expected /dev/??? to exist
> 
> (that would be output from the FileSystem pacemaker resource agent)
> 
> Looking a bit further, I found that even though drbd-overview was
> reporting everything to be fine, I was not seeing all of the expected
> symlinks created in the /dev/drbd/by-res/ directory.  Examination
> showed that across reboots I was getting either one or two of the 
> expected three symlinks created in that directory.  I also was not
> seeing anything identifiable in the other system logs regarding
> a problem, such as running out of resources.
> 
> Eventually, I was able to get things back into a sane state by:
>  1. shutting down corosync on the problem node
>  2. doing a 'chkconfig corosync off'
>  3. rebooting the problem node
>  4. *without* starting corosync, doing a 'drbdadm up DEVICE' on
>     each entry in drbd.conf
>  5. on the working node, put the drbd master/slave sets into the
>     unmanaged state
>  6. start corosync on the problem node
>  7. bring the master/slave sets back into the managed state
>  8. chkconfig corosync on
> 
> I've rebooted the problem node enough times now that I'm reasonably
> confident that whatever the cause of the problem was that it is no
> longer occurring, and I've successfully failed over all services to
> the formerly failing node.
> 
> Despite being in a working state, I'd really prefer to know *why*
> this was happening.  Under what circumstances could we expect DRBD
> to not create the /dev/drbd/by-res/ symlinks? 

It's not DRBD that creates those, it is udev.
It may very well be "just" a timing issue.
(read: maybe you want to add some "sleep" somewhere ... )

If this udev magic turns out to be misbehaving or unreliable for you,
don't use it, but use the /dev/drbd[0-9] device nodes.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed