[DRBD-user] [Fwd: Re: [Linux-HA] Cannot create groupcontaining drbd using HB GUI]

Wed Apr 25 15:51:47 CEST 2007

Hi Lars,
Turns out, I had to take down drbd on the secondary and completely
resync the partitions before I could get them to see each other. Not
sure how I got into that situation though. So, I now have a drbd
master/slave with a filesystem resource following it around via
heartbeat. I have noticed one unusual aspect, which I wanted to ask you
about. Right now, the filesystem resource mounts fine on one node but
not the other due to an earlier error in the switching. I believe I can
correct the situation, but I noticed that the drbd_promote function
checks "are we already primary (local)?" then return success, and "we're
we not secondary (local)? then return generic error. It gets through to
the drbdadm primary command only if we are truly secondary. However,
there is no check to make sure the remote is not still primary. What I'm
seeing happening is a promote call where the secondary node tried to go
primary, but the primary node is still primary (not sure if it just
never got the secondary command or if its a timing issue that it just
hasn't completed switching from primary to secondary). This triggers an
error: "State change failed: (-1) Multiple primaries not allowed by
config". The return from the command is 0, even though the return from
within drbdadm from drbdsetup is 11. This 0 propagates back out to
drbd_promote, and gets returned to heartbeat as Success, so heartbeat
assumes its OK to proceed with the Filesystem mount (which of course
fails). See error snippet below:

drbd[9059]:     2007/04/25_08:44:22 DEBUG: pgsql: Calling /sbin/drbdadm
-c /etc/drbd.conf primary pgsql
drbd[9059]:     2007/04/25_08:44:23 DEBUG: pgsql: Exit code 0
drbd[9059]:     2007/04/25_08:44:23 DEBUG: pgsql: Command output: State
change failed: (-1) Multiple primaries not allowed by config
Command '/sbin/drbdsetup /dev/drbd0 primary' terminated with exit code
11
lrmd[1171]: 2007/04/25_08:44:23 info: RA output:
(rsc_drbd_7788:0:promote:stdout) State change failed: (-1) Multiple
primaries not allowed by config Command '/sbin/drbdsetup /dev/drbd0
primary' terminated with exit code 11
crmd[1176]: 2007/04/25_08:44:23 info: process_lrm_event: LRM operation
rsc_drbd_7788:0_promote_0 (call=50, rc=0) complete 

What I propose to do is to put a conditional wait loop in the promote
function that checks the remote status, and if its Primary then sleep
and loop until the remote is not Primary, or terminate after so many
loops. I think this provides a safer promote, though it doesn't address
the issue shown by the log messages above. Thoughts?

Doug

On Tue, 2007-04-24 at 12:32 -0400, Doug Knight wrote:

> Hi Lars,
> I think I'm pretty close on getting the filesystem following drbd, but
> I've encountered a strange situation. drbd is running on both nodes,
> one indicating Secondary from the /proc/drbd file and Slave in
> heartbeat, the other Master and Primary. However, both show the other
> as Unknown. I'm able to ping successfully over the dedicated CAT6
> cable between the nodes. Any idea why the two drbd processes wouldn't
> see each other?
> 
> Doug
> On Tue, 2007-04-24 at 08:00 -0400, Doug Knight wrote:
> 
> > On Tue, 2007-04-24 at 11:07 +0200, Lars Ellenberg wrote:  
> > 
> > > On Mon, Apr 23, 2007 at 02:39:16PM -0400, Doug Knight wrote:
> > > > Hey Lars,
> > > 
> > > please read the last line of my sig... :)
> > > 
> > 
> > Sorry, I was jumping between email lists, and some of the others
> > come in with a reply-to already set to the list, not the
> > individual...
> > 
> > 
> > > > Quick question: Comparing the printf text in your patch to the various
> > > > checks in the drbd ocf script, should the patch be printf'ing "Not
> > > > configured" as the script is checking for (DRBD_STATE_LOCAL checks)? If
> > > > these should match, then I'd actually prefer to change the checks in the
> > > > ocf script to match your patch (I like not having the embedded blank). 
> > > > 
> > > > Doug
> > > 
> > > I think the 0.7 drbdsetup reported "Not configured".
> > > So maybe we should stick with that?
> > > 
> > > but since scripts tailored for 0.7 are likely to be broken for 8.0
> > > anyways, we could also change this to "Unconfigured".
> > > I'm not sure yet, but I'm more for consistency with the /proc/drbd
> > > output, so its probably going to be "Unconfigured".
> > > 
> > > 
> > 
> > Good, I have already changed my script to look for Unconfigured, and
> > now I can switch drbd back and forth between nodes using a Place
> > constraint with no problem. Now, I'm working on tying the filesystem
> > resource to it. I've got a couple of questions up on the HA-Linux
> > list on that. There is so much overlap between the HA and the drbd
> > work that sometimes its hard to tell which list I should send to (or
> > just keeping track where I've asked what ;)
> > 
> > Thanks!  
> > 
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070425/aa5617d8/attachment.htm>