[DRBD-user] "drbdadm verify" hung after 14%.

Fri Dec 12 11:30:44 CET 2008

On Thu, Dec 11, 2008 at 05:56:36PM -0800, Nolan wrote:
> Hello,
> 
> I've got two nodes running Ubuntu 8.10/64bit using the included DRBD:
> version: 8.2.6 (api:88/proto:86-88)
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
> phil at fat-tyre, 2008-05-30 12:59:17
> 
> Each node has 4 drives, which are striped together using LVM, and then
> cut into 24 logical volumes.  One DRBD is attached to each of the 24
> lvs.  The two nodes speak over 2x bonded e1000s.  All was running well
> for 40+ days with 24 KVM VMs running.
> 
> I decided to try out the online verify functionality, and after adding
> "verify-alg crc32c;" to my config on both hosts, and running adjust, I
> ran:
> drbdadm verify VM24
> 
> All was well, and I watched "/proc/drbd" as the verify progressed.  But
> then it stopped at 14%:
> 24: cs:VerifyS st:Primary/Secondary ds:UpToDate/UpToDate C r---
>     ns:2035752 nr:160958366 dw:162994118 dr:23997736 al:186 bm:10021
> lo:0 pe:4624 ua:0 ap:16 oos:0
>          14%      5847476/39734074
> 
> The VM using that storage is hung hard.  Stracing it shows it blocked in
> a rather uninformative spot:
> root at node1:~# strace -p 9878
> Process 9878 attached - interrupt to quit
> futex(0xb531a0, 0x80 /* FUTEX_??? */, 2
> 
> Dmesg on the secondary node has nothing interesting, but dmesg on the
> primary node has:
> [3628294.472338] drbd24:   state = { cs:Connected st:Primary/Secondary
> ds:UpToDate/UpToDate r--- }
> [3628294.481196] drbd24:  wanted = { cs:VerifyS st:Primary/Secondary
> ds:UpToDate/UpToDate r--- }
> [3628524.571022] drbd24: conn( Connected -> VerifyS ) 
> [3628919.655921] drbd24: qemu-system-x86[10223] Concurrent local write
> detected! [DISCARD L] new: 952311s +3584; pending: 952311s +3584
> [3628919.668048] drbd24: qemu-system-x86[10223] Concurrent local write
> detected! [DISCARD L] new: 952318s +512; pending: 952318s +512
> [3628919.680433] drbd24: qemu-system-x86[10223] Concurrent local write
> detected! [DISCARD L] new: 799599s +3584; pending: 799599s +3584
> [3628919.692566] drbd24: qemu-system-x86[10223] Concurrent local write
> detected! [DISCARD L] new: 799606s +512; pending: 799606s +512
> [3629004.628073] drbd24: qemu-system-x86[10224] Concurrent local write
> detected! [DISCARD L] new: 952311s +3584; pending: 952311s +3584
> [3629004.640192] drbd24: qemu-system-x86[10224] Concurrent local write
> detected! [DISCARD L] new: 952318s +512; pending: 952318s +512
> [3629004.652675] drbd24: qemu-system-x86[10224] Concurrent local write
> detected! [DISCARD L] new: 799599s +3584; pending: 799599s +3584
> [3629004.664787] drbd24: qemu-system-x86[10224] Concurrent local write
> detected! [DISCARD L] new: 799606s +512; pending: 799606s +512
> 
> Any ideas what could be causing this?
> 
> Google on the "concurrent local write" error only turned up the check-in
> that added that code to DRBD.

that means something is locally submitting writes to some block where we
still have pending writes to.  that is unexpected behaviour, and would
lead to non-deterministic results on disk. drbd does not allow that.

> I can leave the system as it is for a few days, if there is more
> information I should collect.

great.

please do
 echo 2 > /sys/module/drbd/parameters/proc_details
and provide the output of 
 cat /proc/drbd
 ps -eo pid,state,wchan:30,cmd | grep -e drbd -e D
 cat /proc/mounts
 cat /proc/partitions
 dmsetup table

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed