[DRBD-user] "drbdadm verify" hung after 14%.

Mon Dec 15 21:06:01 CET 2008

On Sat, Dec 13, 2008 at 12:57:59PM -0800, Nolan wrote:
> On Fri, 2008-12-13 at 17:11 +0100, Lars Ellenberg wrote:
> > appart from running many processes doing direct io to the same
> > block, there is not much I can think of that may produce these
> > concurrent writes.
>
> I am doing direct IO (via kvm's cache=off option).
>
> There is only the one process, but I believe it simulates AIO using
> glibc's thread-based AIO implementation.
>
> Since the guest (debian etch) is using SCSI TCQ, it could in theory
> write the same block many times.  No idea why it would do that though.
>
> I've also no idea why running a verify would trigger it, if that is
> something more than mere coincidence.
>
>either "coincidence", or because of the added IO load (read the whole
>disk and checksum it) changes the timing.  this has nothing to do with
>the "hanging" drbd resource, though.
>
> > can you do the ps -eo | grep magic on the other node as well please?
>
> # ps -eo pid,state,wchan:30,cmd | grep -e drbd -e D
>   PID S WCHAN                          CMD
>  3002 S select                         /usr/local/bin/qemu-system-x86_64
> -m 512 -drive file=/dev/drbd23,if=scsi,cache=off,boot=on -drive
> if=ide,index=2,media=cdrom -usbdevice tablet -name root_vm_0 -net
> nic,macaddr=xx:xx:xx:xx:xx:xx,model=virtio -net tap -monitor
> unix:/tmp/VM_root_vm_0,server,nowait -tdf -daemonize -vnc :0,password

>  5261 S drbd_wait_peer_seq             [drbd24_receiver]
>          ^^^^^^^^^^^^^^^^^^
>
>there.
>that is an interessting hint.
>
>this has been fixed in 8.2.7.
>
>as a workaround, you can use e.g. iptables to disconnect/reconnect,
>but chances are that an online verify will again get stuck in your >setup.
>
>just make sure that you add the drbd-8.2.7 hotfix for the online verify
>as well, so either use the drbd 8.2.7 tarball
>plus this patch:
>http://git.drbd.org/drbd-8.2.git/?p=drbd-8.2.git;a=commitdiff;h=1174410#patch1
>
>or even better, use drbd-8.2 HEAD, there:
>http://git.drbd.org/drbd-8.2.git/
>
>we are confident that we will release a fine 8.3.0 this week,
>which supersedes the 8.2 series.
>

Sorry to add a me too reply to this thread, but we had the exact same
thing happen this weekend, except out system hung at 54% and we use
8.2.7.  I was able to bring the guest back, by putting both nodes in
secondary and then the primary back to primary.

Are you confident this is fixed in 8.3?  I can provide any information
you may need, but our setup is the same except for xen hypervisor,
guests are on lvm/drbd.  One message from the guest that was hung:

Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
time expired, ko = 4294967295

Thanks for your time.