[DRBD-user] dopd failover / undo outdate

Thu Dec 6 17:38:40 CET 2007

See below. Thoughts??? -Rois

On Thu, 2007-12-06 at 10:38 +0100, Dominik Klein wrote:
> > I think I'm starting to get my mind around why and how dopd but
> > currently when the primary node goes down (pull the plug) the secondary
> > node is getting the message to outdate the drbd resource.  Here is where
> > I think it is happening on node2 after I pull the plug on node1:
> > -------------------------------------------------------------------
> > Dec  4 13:22:13 svr92 kernel: drbd0: helper command: /sbin/drbdadm
> > outdate-peer
> > Dec  4 13:22:13 svr92 kernel: drbd0: disk( UpToDate -> Outdated ) 
> > Dec  4 13:22:13 svr92 kernel: drbd0: outdate-peer helper broken,
> > returned 255 
> > Dec  4 13:22:13 svr92 kernel: drbd0: State change failed: Refusing to be
> > Primary without at least one UpToDate disk
> > -------------------------------------------------------------------
> > I can't find any reference to what "returned 255" means but the
> > outdate-peer appears to be broken???
> 
> I see this as well right now (as you may have noticed from my thread 
> about resource fencing). See if you can start drbd-peer-outdater -p 
> <peername> -r <resourcename> by hand (while heartbeat and dopd are running).
> 
> On my testsystem, this segfaults and returns 255 which might indicate 
> that when it is run by drbd (or heartbeat?), segfaults too.
> 

When I run 
/usr/lib/heartbeat/drbd-peer-outdater -p svr92 -r all
from the primary node (node1) I get no errors on node1 but I get this
message in the log on node2 (svr92 is node2):
-------------------------------------------------------------------------
Dec  6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
msg_start_outdate:                                              unknown
exit code from /sbin/drbdadm outdate all: 127
Dec  6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
msg_start_outdate:                                              sending
return code: 5, svr92 -> svr91
-------------------------------------------------------------------------

Same thing in the opposite direction.

> > So . . . node1 goes down and somehow in the process outdates node2's
> > resource so heartbeat can't bring it up.  There goes my redundancy BUT
> > if node1 really is dead, how can I undo the outdate flag on node2 so I
> > can bring that node up as the primary until I can fix node1?
> 
> You can always do "drbdadm -- --overwrite-data-of-peer primary 
> <resourcename>". This will bring your outdated resource into primary 
> state. But you don't want that to happen automatically.
> 

No joy on overwrite-data-of-peer but I did get it to come up when I ran:
drbdsetup all primary -o
on node2 (while node1 was still off) and that brought it back to life as
uptodate.

> Oh and btw: This is also true for the test-version 2.1.3 of heartbeat 
> which was announced for testing yesterday and is scheduled for Dec, 
> 19th. It would be nice if it would get fixed for 2.1.3.
> 
> Regards
> Dominik
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user