[DRBD-user] Recovering from erroneous sync state

Thu May 24 22:03:03 CEST 2012

On May 23, 2012, at 4:48 PM, Lars Ellenberg wrote:

> On Wed, May 23, 2012 at 04:34:28PM -0500, Zev Weiss wrote:
>>>>>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like
>>>>>>>> a bug that was listed as fixed in 8.3.13 -- getting into a
>>>>>>>> state where both nodes are in SyncSource (it's just stuck like
>>>>>>>> that, going nowhere).  Luckily this happened on a test resource
>>>>>>>> and not a live one, so it's not a big problem, but I was
>>>>>>>> wondering if there were any known ways of recovering it without
>>>>>>>> doing anything disruptive to the other resources (e.g.
>>>>>>>> rebooting or unloading the kernel module).
>>>>>>>> 
>>>>>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any
>>>>>>>> other suggestions?  It doesn't really matter to me if it wipes
>>>>>>>> the resource or anything, I'd just like to have my test device
>>>>>>>> back in a working state without disturbing anything else.
>>>>>>> 
>>>>>>> Can you post /proc/drbd contents from both nodes here?
>>>>>>> 
>>>>>> 
>>>>>> Sure -- here's one node:
>>>>>> 
>>>>>> version: 8.3.12 (api:88/proto:86-96)
>>>>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss at mydomain, 2012-03-14 19:52:38
>>>>>> 
>>>>>> <snip other resources>
>>>>>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>>>>>>  ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>>>>>>      [>...................] sync'ed:  5.9% (65536/65536)K
>>>>>>      finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>>>>>>        0% sector pos: 0/10698352
>>>>>>      resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>>      act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>> 
>>>>> drbdsetup 9 disconnect --force
>>>>> may work,
>>>>> if you did not try a non-forced disconnect or similar before,
>>>>> that is to say, if the drbd worker thread is not blocked yet.
>>>>> 
>>>> 
>>>> I think I had tried a non-forced disconnect previously (and perhaps
>>>> also implicitly as part of a 'down' attempt, though I'm not sure
>>>> whether it would have gotten to that step if the disconnect operation
>>>> didn't complete), but 'drbdsetup 9 disconnect --force' also just
>>>> hangs.
>>>> 
>>>>> You can always cut the tcp connection using iptables,
>>>>> which should at least get the worker into a responsive state again.
>>>>> 
>>>> 
>>>> As mentioned in another message in response to Florian, blocking
>>>> the replication port via iptables doesn't seem to have had any
>>>> effect.
>>> 
>>> You would not need to "block" as in DROP, but to REJECT with tcp reset.
>>> 
>> 
>> Sorry, should have worded that more carefully -- it wasn't strictly a
>> DROP, but a REJECT, with icmp-port-unreachable.  I've since tweaked it
>> to reject with tcp-reset instead.  No changes in DRBD state on either
>> side as far as I can see, though both nodes still seem to have (or at
>> least think they have) a tcp connection or two involving that port,
>> despite my efforts with iptables:
>> 
>> [root at node1 ~]# netstat -tn | fgrep 7789
>> tcp      156      0 192.168.1.2:37324           192.168.1.1:7789            ESTABLISHED 
>> tcp        0      0 192.168.1.2:7789            192.168.1.1:47548           ESTABLISHED 
>> 
>> [root at node2 ~]# netstat -tn | fgrep 7789
>> tcp        0      0 192.168.1.1:47548           192.168.1.2:7789            ESTABLISHED 
> 
> hu? should be two as well?
> 

One would think, yes...but that's all it shows.  (No idea why.)

> anyways:
> port=7789;
> for chain in INPUT OUTPUT ; do
>    for direction in dport sport ; do
> 	iptables -I $chain -p tcp --$direction $port -j REJECT --reject-with tcp-reset
>    done
> done
> 
> should do it usually.

Ah, hadn't thought to reject outgoing packets as well -- and doing so seems to have finally had the intended effect (got both sides into StandAlone, then able to force one back to primary and reattach).  All back to normal now.

Thanks!

Zev