Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
See below. Thoughts??? -Rois On Thu, 2007-12-06 at 10:38 +0100, Dominik Klein wrote: > > I think I'm starting to get my mind around why and how dopd but > > currently when the primary node goes down (pull the plug) the secondary > > node is getting the message to outdate the drbd resource. Here is where > > I think it is happening on node2 after I pull the plug on node1: > > ------------------------------------------------------------------- > > Dec 4 13:22:13 svr92 kernel: drbd0: helper command: /sbin/drbdadm > > outdate-peer > > Dec 4 13:22:13 svr92 kernel: drbd0: disk( UpToDate -> Outdated ) > > Dec 4 13:22:13 svr92 kernel: drbd0: outdate-peer helper broken, > > returned 255 > > Dec 4 13:22:13 svr92 kernel: drbd0: State change failed: Refusing to be > > Primary without at least one UpToDate disk > > ------------------------------------------------------------------- > > I can't find any reference to what "returned 255" means but the > > outdate-peer appears to be broken??? > > I see this as well right now (as you may have noticed from my thread > about resource fencing). See if you can start drbd-peer-outdater -p > <peername> -r <resourcename> by hand (while heartbeat and dopd are running). > > On my testsystem, this segfaults and returns 255 which might indicate > that when it is run by drbd (or heartbeat?), segfaults too. > When I run /usr/lib/heartbeat/drbd-peer-outdater -p svr92 -r all from the primary node (node1) I get no errors on node1 but I get this message in the log on node2 (svr92 is node2): ------------------------------------------------------------------------- Dec 6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info: msg_start_outdate: unknown exit code from /sbin/drbdadm outdate all: 127 Dec 6 08:21:39 svr92 /usr/lib/heartbeat/dopd: [4749]: info: msg_start_outdate: sending return code: 5, svr92 -> svr91 ------------------------------------------------------------------------- Same thing in the opposite direction. > > So . . . node1 goes down and somehow in the process outdates node2's > > resource so heartbeat can't bring it up. There goes my redundancy BUT > > if node1 really is dead, how can I undo the outdate flag on node2 so I > > can bring that node up as the primary until I can fix node1? > > You can always do "drbdadm -- --overwrite-data-of-peer primary > <resourcename>". This will bring your outdated resource into primary > state. But you don't want that to happen automatically. > No joy on overwrite-data-of-peer but I did get it to come up when I ran: drbdsetup all primary -o on node2 (while node1 was still off) and that brought it back to life as uptodate. > Oh and btw: This is also true for the test-version 2.1.3 of heartbeat > which was announced for testing yesterday and is scheduled for Dec, > 19th. It would be nice if it would get fixed for 2.1.3. > > Regards > Dominik > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user