[DRBD-user] Recovering from erroneous sync state

Wed May 23 23:48:08 CEST 2012

On Wed, May 23, 2012 at 04:34:28PM -0500, Zev Weiss wrote:
> >>>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like
> >>>>>> a bug that was listed as fixed in 8.3.13 -- getting into a
> >>>>>> state where both nodes are in SyncSource (it's just stuck like
> >>>>>> that, going nowhere).  Luckily this happened on a test resource
> >>>>>> and not a live one, so it's not a big problem, but I was
> >>>>>> wondering if there were any known ways of recovering it without
> >>>>>> doing anything disruptive to the other resources (e.g.
> >>>>>> rebooting or unloading the kernel module).
> >>>>>> 
> >>>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any
> >>>>>> other suggestions?  It doesn't really matter to me if it wipes
> >>>>>> the resource or anything, I'd just like to have my test device
> >>>>>> back in a working state without disturbing anything else.
> >>>>> 
> >>>>> Can you post /proc/drbd contents from both nodes here?
> >>>>> 
> >>>> 
> >>>> Sure -- here's one node:
> >>>> 
> >>>> version: 8.3.12 (api:88/proto:86-96)
> >>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss at mydomain, 2012-03-14 19:52:38
> >>>> 
> >>>> <snip other resources>
> >>>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
> >>>>   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> >>>>       [>...................] sync'ed:  5.9% (65536/65536)K
> >>>>       finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
> >>>>         0% sector pos: 0/10698352
> >>>>       resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> >>>>       act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
> >>> 
> >>> drbdsetup 9 disconnect --force
> >>> may work,
> >>> if you did not try a non-forced disconnect or similar before,
> >>> that is to say, if the drbd worker thread is not blocked yet.
> >>> 
> >> 
> >> I think I had tried a non-forced disconnect previously (and perhaps
> >> also implicitly as part of a 'down' attempt, though I'm not sure
> >> whether it would have gotten to that step if the disconnect operation
> >> didn't complete), but 'drbdsetup 9 disconnect --force' also just
> >> hangs.
> >> 
> >>> You can always cut the tcp connection using iptables,
> >>> which should at least get the worker into a responsive state again.
> >>> 
> >> 
> >> As mentioned in another message in response to Florian, blocking
> >> the replication port via iptables doesn't seem to have had any
> >> effect.
> > 
> > You would not need to "block" as in DROP, but to REJECT with tcp reset.
> > 
> 
> Sorry, should have worded that more carefully -- it wasn't strictly a
> DROP, but a REJECT, with icmp-port-unreachable.  I've since tweaked it
> to reject with tcp-reset instead.  No changes in DRBD state on either
> side as far as I can see, though both nodes still seem to have (or at
> least think they have) a tcp connection or two involving that port,
> despite my efforts with iptables:
> 
> [root at node1 ~]# netstat -tn | fgrep 7789
> tcp      156      0 192.168.1.2:37324           192.168.1.1:7789            ESTABLISHED 
> tcp        0      0 192.168.1.2:7789            192.168.1.1:47548           ESTABLISHED 
> 
> [root at node2 ~]# netstat -tn | fgrep 7789
> tcp        0      0 192.168.1.1:47548           192.168.1.2:7789            ESTABLISHED 

hu? should be two as well?

anyways:
port=7789;
for chain in INPUT OUTPUT ; do
    for direction in dport sport ; do
	iptables -I $chain -p tcp --$direction $port -j REJECT --reject-with tcp-reset
    done
done

should do it usually.

> For what it's worth, node1 is the one that thinks the resource is
> Secondary/Secondary, node2 is the one that shows it as
> Secondary/Primary.
> 
> > You could also try "ifdown" sleep a while and ifup again.
> > (which obviously will impact the other resources, and everything going
> > via that interface).
> > 
> 
> Right, but I don't really want to disrupt replication on all the other
> (production) resources, so just leaving it as-is until my next
> scheduled maintenance reboot (at which point I plan on  would be
> preferable.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed