[DRBD-user] "drbdadm verify" hung after 14%.

Lars Ellenberg lars.ellenberg at linbit.com
Tue Dec 16 14:13:45 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Dec 15, 2008 at 03:06:01PM -0500, Coach-X wrote:
> On Sat, Dec 13, 2008 at 12:57:59PM -0800, Nolan wrote:
> > On Fri, 2008-12-13 at 17:11 +0100, Lars Ellenberg wrote:
> > > appart from running many processes doing direct io to the same
> > > block, there is not much I can think of that may produce these
> > > concurrent writes.
> >
> > I am doing direct IO (via kvm's cache=off option).
> >
> > There is only the one process, but I believe it simulates AIO using
> > glibc's thread-based AIO implementation.
> >
> > Since the guest (debian etch) is using SCSI TCQ, it could in theory
> > write the same block many times.  No idea why it would do that though.
> >
> > I've also no idea why running a verify would trigger it, if that is
> > something more than mere coincidence.
> >
> >either "coincidence", or because of the added IO load (read the whole
> >disk and checksum it) changes the timing.  this has nothing to do with
> >the "hanging" drbd resource, though.
> >
> > > can you do the ps -eo | grep magic on the other node as well please?
> >
> > # ps -eo pid,state,wchan:30,cmd | grep -e drbd -e D
> >   PID S WCHAN                          CMD
> >  3002 S select                         /usr/local/bin/qemu-system-x86_64
> > -m 512 -drive file=/dev/drbd23,if=scsi,cache=off,boot=on -drive
> > if=ide,index=2,media=cdrom -usbdevice tablet -name root_vm_0 -net
> > nic,macaddr=xx:xx:xx:xx:xx:xx,model=virtio -net tap -monitor
> > unix:/tmp/VM_root_vm_0,server,nowait -tdf -daemonize -vnc :0,password
> 
> >  5261 S drbd_wait_peer_seq             [drbd24_receiver]
> >          ^^^^^^^^^^^^^^^^^^
> >
> >there.
> >that is an interessting hint.
> >
> >this has been fixed in 8.2.7.
> >
> >as a workaround, you can use e.g. iptables to disconnect/reconnect,
> >but chances are that an online verify will again get stuck in your >setup.
> >
> >just make sure that you add the drbd-8.2.7 hotfix for the online verify
> >as well, so either use the drbd 8.2.7 tarball
> >plus this patch:
> >http://git.drbd.org/drbd-8.2.git/?p=drbd-8.2.git;a=commitdiff;h=1174410#patch1
> >
> >or even better, use drbd-8.2 HEAD, there:
> >http://git.drbd.org/drbd-8.2.git/
> >
> >we are confident that we will release a fine 8.3.0 this week,
> >which supersedes the 8.2 series.
> >
> 
> Sorry to add a me too reply to this thread, but we had the exact same
> thing happen this weekend, except out system hung at 54% and we use
> 8.2.7.

hm.

now, you may have seen similar symptoms.
but when you say "exact same thing",
would you please tell me what
exactly that same thing is,
that is _what_ symptoms you see,
so I can figure if it maybe has a similar _cause_ as well?

> I was able to bring the guest back, by putting both nodes in
> secondary and then the primary back to primary.

then it is definetly not the same thing.
and I wonder why that would have helped.

> Are you confident this is fixed in 8.3?

as I don't know what "it" is,
I cannot say.

> I can provide any information you may need, but our setup is the same
> except for xen hypervisor, guests are on lvm/drbd.  One message from
> the guest that was hung:
> 
> Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
> time expired, ko = 4294967295

and this is something completely different.

these messages appear when a Primary is not able to get _data_ through
to the other node, but it still responds timely on non-data drbd packets.

which means that your secondary was apparently so busy that its IO
subsystem did not serve data requests in time, or the drbd receiver
thread on the secondary got stuck somewhere.

and if it does not count down (4294967295, 94, 93, 92, etc.),
but "occasionally" saying "ko = 4294967295",
then it was not stuck at all, but (very) slowly making progress, still.

see also the ko-count config option.

so maybe there is no bug in drbd at all, in your situation, but "just"
an overloaded secondary, or maybe a network stack on memory pressure.

not "the same thing" as an internal deadlock on some sequence counter
due to a missing wake_up call because of drbd packets on the meta socket
overtaking drbd packets on the data socket and forgetting to update that
sequence counter. not at all.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list