[DRBD-user] dopd failover / undo outdate
Rois Cannon
rois at cobiz.com
Thu Dec 6 17:38:40 CET 2007
See below. Thoughts??? -Rois
On Thu, 2007-12-06 at 10:38 +0100, Dominik Klein wrote:
> > I think I'm starting to get my mind around why and how dopd but
> > currently when the primary node goes down (pull the plug) the secondary
> > node is getting the message to outdate the drbd resource. Here is where
> > I think it is happening on node2 after I pull the plug on node1:
> > -------------------------------------------------------------------
> > Dec 4 13:22:13 svr92 kernel: drbd0: helper command: /sbin/drbdadm
> > outdate-peer
> > Dec 4 13:22:13 svr92 kernel: drbd0: disk( UpToDate -> Outdated )
> > Dec 4 13:22:13 svr92 kernel: drbd0: outdate-peer helper broken,
> > returned 255
> > Dec 4 13:22:13 svr92 kernel: drbd0: State change failed: Refusing to be
> > Primary without at least one UpToDate disk
> > -------------------------------------------------------------------
> > I can't find any reference to what "returned 255" means but the
> > outdate-peer appears to be broken???
>
> I see this as well right now (as you may have noticed from my thread
> about resource fencing). See if you can start drbd-peer-outdater -p
> <peername> -r <resourcename> by hand (while heartbeat and dopd are running).
>
> On my testsystem, this segfaults and returns 255 which might indicate
> that when it is run by drbd (or heartbeat?), segfaults too.
>
When I run
/usr/lib/heartbeat/drbd-peer-outdater -p svr92 -r all
from the primary node (node1) I get no errors on node1 but I get this
message in the log on node2 (svr92 is node2):
-------------------------------------------------------------------------
Dec 6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
msg_start_outdate: unknown
exit code from /sbin/drbdadm outdate all: 127
Dec 6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info:
msg_start_outdate: sending
return code: 5, svr92 -> svr91
-------------------------------------------------------------------------
Same thing in the opposite direction.
> > So . . . node1 goes down and somehow in the process outdates node2's
> > resource so heartbeat can't bring it up. There goes my redundancy BUT
> > if node1 really is dead, how can I undo the outdate flag on node2 so I
> > can bring that node up as the primary until I can fix node1?
>
> You can always do "drbdadm -- --overwrite-data-of-peer primary
> <resourcename>". This will bring your outdated resource into primary
> state. But you don't want that to happen automatically.
>
No joy on overwrite-data-of-peer but I did get it to come up when I ran:
drbdsetup all primary -o
on node2 (while node1 was still off) and that brought it back to life as
uptodate.
> Oh and btw: This is also true for the test-version 2.1.3 of heartbeat
> which was announced for testing yesterday and is scheduled for Dec,
> 19th. It would be nice if it would get fixed for 2.1.3.
>
> Regards
> Dominik
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
More information about the drbd-user
mailing list