Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Jan 15, 2009 at 01:02:19PM +0100, Tomasz Chmielewski wrote: > I was experiencing a weird user-space problem when DRBD nodes were > connecting/disconnecting. So let's try to troubleshoot... version? > I was connecting/disconnecting the nodes (either by killing the VPN > connection, or by using "drbdadm connect/disconnect" command). > > Until: > > # drbdadm disconnect some_thing > State change failed: (-2) Refusing to be Primary without at least one UpToDate disk > Command 'drbdsetup /dev/drbd9 disconnect' terminated with exit code 11 > > Which is not entirely true, as this is the Primary and UpToDate! > > 9: cs:WFBitMapS st:Primary/Secondary ds:UpToDate/Inconsistent C r--- > ns:0 nr:0 dw:144992692 dr:1680068992 al:603186 > bm:601017 lo:0 pe:1 ua:0 ap:1 resync: used:0/31 > hits:224202 misses:1770 starving:0 dirty:0 changed:1770 > act_log: used:1/127 hits:35644987 misses:614310 starving:1481 dirty:10251 > changed:603186 > > > What's more problematic, data on this primary is _inaccessible_, i.e., when we do: > > # fdisk -l /dev/drbd9 > > We won't get any output, as fdisk (and a handful of other processes) will > be in uninterruptible sleep, waiting until DRBD changes state from > WFBitMapS: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 21626 manager 20 0 11784 664 340 R 60 0.1 1:52.51 bash > 174 root 15 -5 0 0 0 R 39 0.0 102:27.50 kswapd0 > 3161 root 20 0 0 0 0 D 20 0.0 2:23.97 drbd9_worker > 3361 root 20 0 4177m 8828 732 R 20 0.9 10:21.97 tgtd > 7956 root 20 0 0 0 0 D 20 0.0 3:25.06 pdflush > 21203 root 20 0 0 0 0 R 20 0.0 2:07.11 drbd9_receiver > 21617 root 20 0 3908 484 400 R 20 0.0 0:45.01 fdisk > > > Is it by design, that when a node is a primary and secondary connects to it, we loose > access to data on the primary? during bitmap exchange, application io is postponed. yes. > Also, there seems to be a bug somewhere in the code responsible for WFBitMapS: it's in that > state for more than 10 minutes now, although secondary is accessible. Is reboot the only > option now to recover? __VERSION__? there have been races during bitmap exchange, that may lead to deadlock (what you are seeing). as a work around, to get out of that deadlock, you can cut the tcp connection (no, drbdadm will not do it; use iptables to temporarily reject with tcp-reset... make sure you cut _both_ data and meta data connection). you can first suspend drbd (drbdsetup <dev> suspend-io), then connect it, wait for the bitmap exchange to be over (connection state either Connected already, of SyncSource/SyncTarget), and resume io (drbdsetup <dev> resume-io) as a fix, upgrade to either 8.0.14 or 8.3.0. if you already use one of those versions, let me know how to reproduce. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed