[DRBD-user] Recovering from erroneous sync state

Wed May 23 22:47:39 CEST 2012

On Wed, May 23, 2012 at 10:34 PM, Zev Weiss <zweiss at scout.wisc.edu> wrote:
>
> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>
>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss at scout.wisc.edu> wrote:
>>> Hi,
>>>
>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere).  Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>>
>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions?  It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>>
>> Can you post /proc/drbd contents from both nodes here?
>>
>
> Sure -- here's one node:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss at mydomain, 2012-03-14 19:52:38
>
> <snip other resources>
>  9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>        [>...................] sync'ed:  5.9% (65536/65536)K
>        finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>          0% sector pos: 0/10698352
>        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>        act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
> And here's the other:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss at fromage.scout.wisc.edu, 2012-03-14 19:52:38
>
> <snip other resources>
>  9: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
>    ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>        [>...................] sync'ed:  5.9% (65536/65536)K
>        finish: 18987:55:05 speed: 0 (0 -- 0) K/sec (stalled)
>          0% sector pos: 0/10698352
>        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>        act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0

Ugh. Can you force the device into the WFConnection state by injecting
a couple of iptables rules blocking the replication port, and then
"down" the resource?

Also, Lars, can you shed a little more light on the bug, and its
8.3.13 fix? I had thought the fix was in commit 305dce2c, but it
apparently fixes c19050f4 (which as per git describe was some thirty
commits after 8.3.12, so it shouldn't affect an 8.3.12 user).

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now