Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, That bitmap adjustment did the trick, the resource-only fencing is working now. I'll investigate the new metadata tools, the less of this stuff we have to maintain the better. Thanks! Peter > -----Original Message----- > From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user- > bounces at lists.linbit.com] On Behalf Of Lars Ellenberg > Sent: Saturday, February 06, 2010 1:12 PM > To: drbd-user at lists.linbit.com > Subject: Re: [DRBD-user] drbd fencing policy problem,upgrading from > 8.2.7 -> 8.3.6 > > On Fri, Feb 05, 2010 at 12:46:24PM -0500, Petrakis, Peter wrote: > > Hi All, > > > > We use resource-only fencing currently which is causing us a problem > > when we're bringing up a cluster for the first time. This all worked > > well with 8.2.7. We only have one node active at this time and the > > fencing handler is bailing with '5' which is correct. The problem is, > > this return failure is stopping us from becoming Primary. > > > > > > block drbd5: helper command: /usr/lib/spine/bin/avance_drbd_helper > > fence-peer minor-5 exit code 5 (0x500) > > block drbd5: State change failed: Refusing to be Primary without at > > least one UpToDate disk > > block drbd5: state = { cs:StandAlone ro:Secondary/Unknown > > ds:Consistent/DUnknown r--- } > > block drbd5: wanted = { cs:StandAlone ro:Primary/Unknown > > ds:Consistent/DUnknown r--- } > > > I'm quoting our internal bugzilla here: > > scenario: > A (Primary) --- b (Secondary) > A (Primary) b (down, crashed; does not know that it is out of > date) > a (down) > clean shutdown or crashed, does not matter, knows b is Outdated... > > a (still down) b (reboot) > after init dead, heartbeat tries to promote > > current behaviour: > b (Secondary, Consistent, pdsk DUnknown) > calls dopd, times out, considers other node as outdated, > and goes Primary. > > proposal: > only try to outdate peer if local data is UpToDate. > - works for primary crash (secondary data is UpToDate) > - works for secondary crash (naturally) > - works for primary reboot while secondary offline, > because pdsk is marked Outdated in local meta data, > so we can become UpToDate on restart. > - additionally works for previously described scenario > because crashed secondary has pdsk Unknown, > thus comes up as Consistent (not UpToDate) > > avoids one more way to create diverging data sets. > > if you want to force Consistent to be UpToDate, > you'd then need either fiddle with drbdmeta, > or maybe just the good old "--overwrite-data-of-peer" does the trick? > > does not work for cluster crash, > when only former secondary comes up. > > > ------- Comment #1 From Florian Haas 2009-07-03 09:07:57 [reply] ------ > - > > Bumping severity to "normal" and setting target milestone to 8.3.3. > > This is not an enhancement feature, it's a real (however subtle) bug. > This > tends to break dual-Primary setups with fencing, such as with GFS: > > - Dual Primary configuration, GFS and CMAN running. > - Replication link goes away. > - Both nodes are now "degraded clusters". > - GFS/CMAN initiates fencing, one node gets rebooted. > - Node reboots, link is still down. Since we were "degraded clusters" > to begin > with, degr-wfc-timeout now applies, which is finite by default. > - After the timeout expires, the recovered node is now Consistent and > attempts > to fence the peer, when it should not. > - Since the network link is still down, fencing fails, but we now > assume the > peer is dead, and the node becomes Primary anyway. > - We have split brain, diverging datasets, all our fencing precautions > are moot. > > ------- Comment #3 From Philipp Reisner 2009-08-25 14:42:58 [reply] --- > ---- > > Proposed solution: > > Just to recap, the expected exit codes of the fence-peer-hander: > > 3 Peer's disk state was already Inconsistent. > 4 Peer's disk state was successfully set to Outdated (or was Outdated > to begin with). > 5 Connection to the peer node failed, peer could not be reached. > 6 Peer refused to be outdated because the affected resource was in the > primary role. > 7 Peer node was successfully fenced off the cluster. This should never > occur > unless fencing is set to resource-and-stonith for the affected > resource. > > Now, if we get a 5 (peer not reachable) and we are not UpToDate (that > means we > are only Consistent) then refuse to become primary and do not consider > the peer > as outdated. > > The change in DRBD is minimal, but we need then to implement exit code > 5 also > in crm-fence-peer.sh > > ------- Comment #4 From Philipp Reisner 2009-08-26 19:15:33 [reply] --- > ---- > > commit 3a26bafa2e27892c9e157720525a094b55748f09 > Author: Philipp Reisner <philipp.reisner at linbit.com> > Date: Tue Aug 25 14:49:19 2009 +0200 > > Do not consider peer dead when fence-peer handler can not reach it > and we are < UpToDate > > ------- Comment #5 From Philipp Reisner 2009-10-21 13:26:58 [reply] --- > ---- > > Released with 8.3.3 > > > We setup our metadata in advance before we use drbdsetup to attach > the > > disk. Here's the > > metadata. > > > > version "v08"; > > uuid { > > 0x0000000000000006; 0x0000000000000000; 0x0000000000000000; > > 0x0000000000000000; > > flags 0x00000011; > > Right. I assume you do it this way to avoid the initial sync. > There is no need for that anymore, we now have the "new-uuid --clear- > bitmap" > for that. > > Anyways, it should work for you again if you initialize the flags to > also > mark the peer as outdated, i.e. instead of 0x11, make that 0x31. > > > The flags we set ought to be telling DRBD that our data is UpToDate > > enough to become > > Primary but it doesn't seem to matter while fencing is enabled. The > > result is > > the same regardless of whether I specify '-o' while calling drbdsetup > > primary. > > > > This is how the disk is being setup. > > > > /sbin/drbdsetup /dev/drbd5 disk /dev/disk-drbd5 /dev/disk-drbd5 > internal > > --set-defaults --create-device > > --on-io-error=detach -f resource-only > > > > Thanks in advance. > > > > Peter > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD(r) and LINBIT(r) are registered trademarks of LINBIT, Austria. > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user