[DRBD-user] Recovering from erroneous sync state

Wed May 23 23:18:37 CEST 2012

On Wed, May 23, 2012 at 04:12:23PM -0500, Zev Weiss wrote:
> 
> On May 23, 2012, at 3:45 PM, Lars Ellenberg wrote:
> 
> > On Wed, May 23, 2012 at 03:34:27PM -0500, Zev Weiss wrote:
> >> 
> >> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
> >> 
> >>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss at scout.wisc.edu> wrote:
> >>>> Hi,
> >>>> 
> >>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere).  Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
> >>>> 
> >>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions?  It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
> >>> 
> >>> Can you post /proc/drbd contents from both nodes here?
> >>> 
> >> 
> >> Sure -- here's one node:
> >> 
> >> version: 8.3.12 (api:88/proto:86-96)
> >> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss at mydomain, 2012-03-14 19:52:38
> >> 
> >> <snip other resources>
> >> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
> >>    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> >>        [>...................] sync'ed:  5.9% (65536/65536)K
> >>        finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
> >>          0% sector pos: 0/10698352
> >>        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> >>        act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
> > 
> > drbdsetup 9 disconnect --force
> > may work,
> > if you did not try a non-forced disconnect or similar before,
> > that is to say, if the drbd worker thread is not blocked yet.
> > 
> 
> I think I had tried a non-forced disconnect previously (and perhaps
> also implicitly as part of a 'down' attempt, though I'm not sure
> whether it would have gotten to that step if the disconnect operation
> didn't complete), but 'drbdsetup 9 disconnect --force' also just
> hangs.
> 
> > You can always cut the tcp connection using iptables,
> > which should at least get the worker into a responsive state again.
> > 
> 
> As mentioned in another message in response to Florian, blocking the replication port via iptables doesn't seem to have had any effect.

You would not need to "block" as in DROP, but to REJECT with tcp reset.

You could also try "ifdown" sleep a while and ifup again.
(which obviously will impact the other resources, and everything going
via that interface).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com