[DRBD-user] WFBitMapS: secondary gone, data on the primary inaccessible
lars.ellenberg at linbit.com
Thu Jan 15 17:26:16 CET 2009
On Thu, Jan 15, 2009 at 01:02:19PM +0100, Tomasz Chmielewski wrote:
> I was experiencing a weird user-space problem when DRBD nodes were
> connecting/disconnecting. So let's try to troubleshoot...
> I was connecting/disconnecting the nodes (either by killing the VPN
> connection, or by using "drbdadm connect/disconnect" command).
> # drbdadm disconnect some_thing
> State change failed: (-2) Refusing to be Primary without at least one UpToDate disk
> Command 'drbdsetup /dev/drbd9 disconnect' terminated with exit code 11
> Which is not entirely true, as this is the Primary and UpToDate!
> 9: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r---
> ns:0 nr:0 dw:144992692 dr:1680068992 al:603186
> bm:601017 lo:0 pe:1 ua:0 ap:1 resync: used:0/31
> hits:224202 misses:1770 starving:0 dirty:0 changed:1770
> act_log: used:1/127 hits:35644987 misses:614310 starving:1481 dirty:10251
> What's more problematic, data on this primary is _inaccessible_, i.e., when we do:
> # fdisk -l /dev/drbd9
> We won't get any output, as fdisk (and a handful of other processes) will
> be in uninterruptible sleep, waiting until DRBD changes state from
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 21626 manager 20 0 11784 664 340 R 60 0.1 1:52.51 bash
> 174 root 15 -5 0 0 0 R 39 0.0 102:27.50 kswapd0
> 3161 root 20 0 0 0 0 D 20 0.0 2:23.97 drbd9_worker
> 3361 root 20 0 4177m 8828 732 R 20 0.9 10:21.97 tgtd
> 7956 root 20 0 0 0 0 D 20 0.0 3:25.06 pdflush
> 21203 root 20 0 0 0 0 R 20 0.0 2:07.11 drbd9_receiver
> 21617 root 20 0 3908 484 400 R 20 0.0 0:45.01 fdisk
> Is it by design, that when a node is a primary and secondary connects to it, we loose
> access to data on the primary?
during bitmap exchange,
application io is postponed.
> Also, there seems to be a bug somewhere in the code responsible for WFBitMapS: it's in that
> state for more than 10 minutes now, although secondary is accessible. Is reboot the only
> option now to recover?
there have been races during bitmap exchange,
that may lead to deadlock (what you are seeing).
as a work around, to get out of that deadlock,
you can cut the tcp connection (no, drbdadm will not
do it; use iptables to temporarily reject with tcp-reset...
make sure you cut _both_ data and meta data connection).
you can first suspend drbd (drbdsetup <dev> suspend-io), then connect
it, wait for the bitmap exchange to be over (connection state either
Connected already, of SyncSource/SyncTarget), and resume io (drbdsetup
as a fix, upgrade to either 8.0.14 or 8.3.0.
if you already use one of those versions,
let me know how to reproduce.
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
please don't Cc me, but send to list -- I'm subscribed
More information about the drbd-user