[DRBD-user] WFBitMapS: secondary gone, data on the primary inaccessible

Thu Jan 15 17:26:16 CET 2009

On Thu, Jan 15, 2009 at 01:02:19PM +0100, Tomasz Chmielewski wrote:
> I was experiencing a weird user-space problem when DRBD nodes were
> connecting/disconnecting. So let's try to troubleshoot...

version?

> I was connecting/disconnecting the nodes (either by killing the VPN
> connection, or by using "drbdadm connect/disconnect" command).
>
> Until:
>
> # drbdadm disconnect some_thing
> State change failed: (-2) Refusing to be Primary without at least one UpToDate disk
> Command 'drbdsetup /dev/drbd9 disconnect' terminated with exit code 11
>
> Which is not entirely true, as this is the Primary and UpToDate!
>
> 9: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r---      
>                    ns:0 nr:0 dw:144992692 dr:1680068992 al:603186 
> bm:601017 lo:0 pe:1 ua:0 ap:1                  resync: used:0/31 
> hits:224202 misses:1770 starving:0 dirty:0 changed:1770                 
> act_log: used:1/127 hits:35644987 misses:614310 starving:1481 dirty:10251 
> changed:603186
>
>
> What's more problematic, data on this primary is _inaccessible_, i.e., when we do:
>
> # fdisk -l /dev/drbd9
>
> We won't get any output, as fdisk (and a handful of other processes) will 
> be in uninterruptible sleep, waiting until DRBD changes state from 
> WFBitMapS:
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 21626 manager   20   0 11784  664  340 R   60  0.1   1:52.51 bash
>  174 root      15  -5     0    0    0 R   39  0.0 102:27.50 kswapd0
> 3161 root      20   0     0    0    0 D   20  0.0   2:23.97 drbd9_worker
> 3361 root      20   0 4177m 8828  732 R   20  0.9  10:21.97 tgtd
> 7956 root      20   0     0    0    0 D   20  0.0   3:25.06 pdflush
> 21203 root      20   0     0    0    0 R   20  0.0   2:07.11 drbd9_receiver
> 21617 root      20   0  3908  484  400 R   20  0.0   0:45.01 fdisk
>
>
> Is it by design, that when a node is a primary and secondary connects to it, we loose
> access to data on the primary?

during bitmap exchange,
application io is postponed.
yes.

> Also, there seems to be a bug somewhere in the code responsible for WFBitMapS: it's in that
> state for more than 10 minutes now, although secondary is accessible. Is reboot the only
> option now to recover?

__VERSION__?

there have been races during bitmap exchange,
that may lead to deadlock (what you are seeing).

as a work around, to get out of that deadlock,
you can cut the tcp connection (no, drbdadm will not
do it; use iptables to temporarily reject with tcp-reset...
make sure you cut _both_ data and meta data connection).

you can first suspend drbd (drbdsetup <dev> suspend-io), then connect
it, wait for the bitmap exchange to be over (connection state either
Connected already, of SyncSource/SyncTarget), and resume io (drbdsetup
<dev> resume-io)

as a fix, upgrade to either 8.0.14 or 8.3.0.

if you already use one of those versions,
let me know how to reproduce.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed