[DRBD-user] dopd failover / undo outdate

Fri Dec 7 18:31:27 CET 2007

OK . . . Back tracking a little here. Still would like to know where I
can get some dopd doc’s but apparently “drbdadm outdate all” doesn’t
work unless the dopd nodes are disconnected. I tried disconnecting and
it worked.

Upon closer inspection of the dopd syslog error it appears that it’s
looking for drbdadm in /sbin and my distro has it in /usr/sbin. I tried
copying and ln -s drbdadm, drbdsetup and drbdmeta to /sbin but neither
option works and now the log shows
unknown exit code from /sbin/drbdadm outdate all: 126
instead of
unknown exit code from /sbin/drbdadm outdate all: 127

Any clues to where the path problem is?
Thanx

On Thu, 2007-12-06 at 09:28 -0800, Rois Cannon wrote:
> After trying Dominik's suggestion of running the drbd-peer-outdater
> manually, I've come to believe that dopd is NOT working on my machine.
> 
> Just for giggles I took out the dopd stuff
> drbd.conf: 
>    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater
> ha.cf:
>    respawn hacluster /usr/lib/heartbeat/dopd
>    apiauth dopd gid=haclient uid=hacluster
> but left
> drbd.conf:
>    fencing    resource-only
> 
> and ran the same test where I pull the plug on node1.  Node2 became
> outdated and so heartbeat would not bring it up.  This suggests to me
> that the dopd stuff wasn't doing anything.
> 
> On node2 (when everything was uptodate and running) I then tried running
> what the log says drbd-peer-outdater is doing:
> drbdadm outdate all
> 
> This command gave no errors and did not outdate the resources on node2.
> I was able to restart heartbeat on node1 and node2 took over the
> resources as I would expect if drbd was uptodate(as it was still marked
> after I tried outdating it manually.)
> 
> Florian (or others), do you have any ideas on what my problem is with
> drbdadm outdate all.  I could use some suggestions on how to debug this.
> 
> Thanx
> Rois
> 
> 
> 
> 
> On Thu, 2007-12-06 at 08:38 -0800, Rois Cannon wrote:
> > See below. Thoughts??? -Rois
> > 
> > On Thu, 2007-12-06 at 10:38 +0100, Dominik Klein wrote:
> > > > I think I'm starting to get my mind around why and how dopd but
> > > > currently when the primary node goes down (pull the plug) the secondary
> > > > node is getting the message to outdate the drbd resource.  Here is where
> > > > I think it is happening on node2 after I pull the plug on node1:
> > > > -------------------------------------------------------------------
> > > > Dec  4 13:22:13 svr92 kernel: drbd0: helper command: /sbin/drbdadm
> > > > outdate-peer
> > > > Dec  4 13:22:13 svr92 kernel: drbd0: disk( UpToDate -> Outdated ) 
> > > > Dec  4 13:22:13 svr92 kernel: drbd0: outdate-peer helper broken,
> > > > returned 255 
> > > > Dec  4 13:22:13 svr92 kernel: drbd0: State change failed: Refusing to be
> > > > Primary without at least one UpToDate disk
> > > > -------------------------------------------------------------------
> > > > I can't find any reference to what "returned 255" means but the
> > > > outdate-peer appears to be broken???
> > > 
> > > I see this as well right now (as you may have noticed from my thread 
> > > about resource fencing). See if you can start drbd-peer-outdater -p 
> > > <peername> -r <resourcename> by hand (while heartbeat and dopd are running).
> > > 
> > > On my testsystem, this segfaults and returns 255 which might indicate 
> > > that when it is run by drbd (or heartbeat?), segfaults too.
> > > 
> > 
> > When I run 
> > /usr/lib/heartbeat/drbd-peer-outdater -p svr92 -r all
> > from the primary node (node1) I get no errors on node1 but I get this
> > message in the log on node2 (svr92 is node2):
> > -------------------------------------------------------------------------
> > Dec  6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
> > msg_start_outdate:                                              unknown
> > exit code from /sbin/drbdadm outdate all: 127
> > Dec  6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
> > msg_start_outdate:                                              sending
> > return code: 5, svr92 -> svr91
> > -------------------------------------------------------------------------
> > 
> > Same thing in the opposite direction.
> > 
> > > > So . . . node1 goes down and somehow in the process outdates node2's
> > > > resource so heartbeat can't bring it up.  There goes my redundancy BUT
> > > > if node1 really is dead, how can I undo the outdate flag on node2 so I
> > > > can bring that node up as the primary until I can fix node1?
> > > 
> > > You can always do "drbdadm -- --overwrite-data-of-peer primary 
> > > <resourcename>". This will bring your outdated resource into primary 
> > > state. But you don't want that to happen automatically.
> > > 
> > 
> > No joy on overwrite-data-of-peer but I did get it to come up when I ran:
> > drbdsetup all primary -o
> > on node2 (while node1 was still off) and that brought it back to life as
> > uptodate.
> > 
> > 
> > > Oh and btw: This is also true for the test-version 2.1.3 of heartbeat 
> > > which was announced for testing yesterday and is scheduled for Dec, 
> > > 19th. It would be nice if it would get fixed for 2.1.3.
> > > 
> > > Regards
> > > Dominik
> > > _______________________________________________
> > > drbd-user mailing list
> > > drbd-user at lists.linbit.com
> > > http://lists.linbit.com/mailman/listinfo/drbd-user
> > 
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user