[DRBD-user] "drbdadm verify" hung after 14%.

Lars Ellenberg lars.ellenberg at linbit.com
Tue Dec 16 16:33:42 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Dec 16, 2008 at 09:25:36AM -0500, Coach-X wrote:
> >> Sorry to add a me too reply to this thread, but we had the exact same
> >> thing happen this weekend, except out system hung at 54% and we use
> >> 8.2.7.
> > 
> > hm.
> > 
> > now, you may have seen similar symptoms.
> > but when you say "exact same thing",
> > would you please tell me what
> > exactly that same thing is,
> > that is _what_ symptoms you see,
> > so I can figure if it maybe has a similar _cause_ as well?
> 
> Similar symtoms.  Auto verify initiated, one guest "hung" unable to
> access the guest.  The auto verify was "stuck" at 54% after 11 1/2 hours.
> 
> Here is the log for drbd1:
> 
> Dec 13 00:05:48 xen01 kernel: drbd1: conn( Connected -> VerifyS )
> Dec 13 11:28:52 xen01 kernel: drbd1: peer( Secondary -> Unknown ) conn(
> VerifyS -> TearDown ) pdsk( UpToDate -> DUnknown )
> Dec 13 11:28:52 xen01 kernel: drbd1: Creating new current UUID
> Dec 13 11:28:52 xen01 kernel: drbd1: meta connection shut down by peer.
> Dec 13 11:28:52 xen01 kernel: drbd1: asender terminated
> Dec 13 11:28:52 xen01 kernel: drbd1: Terminating asender thread
> Dec 13 11:28:52 xen01 kernel: drbd1: Connection closed
> Dec 13 11:28:52 xen01 kernel: drbd1: conn( TearDown -> Unconnected )
> Dec 13 11:28:52 xen01 kernel: drbd1: receiver terminated
> Dec 13 11:28:52 xen01 kernel: drbd1: Restarting receiver thread
> Dec 13 11:28:52 xen01 kernel: drbd1: receiver (re)started
> Dec 13 11:28:52 xen01 kernel: drbd1: conn( Unconnected -> WFConnection )
> Dec 13 11:47:57 xen01 kernel: drbd1: Handshake successful: Agreed
> network protocol version 88
> Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFConnection -> WFReportParams )
> Dec 13 11:47:57 xen01 kernel: drbd1: Starting asender thread (from
> drbd1_receiver [14822])
> Dec 13 11:47:57 xen01 kernel: drbd1: data-integrity-alg: <not-used>
> Dec 13 11:47:57 xen01 kernel: drbd1: drbd_sync_handshake:
> Dec 13 11:47:57 xen01 kernel: drbd1: self
> 7D59777BC5085D57:D1844DEFB56C9A8B:8F9134EABA18E3E0:19D0B2E43F10CCE5
> Dec 13 11:47:57 xen01 kernel: drbd1: peer
> D1844DEFB56C9A8A:0000000000000000:8F9134EABA18E3E0:19D0B2E43F10CCE5
> Dec 13 11:47:57 xen01 kernel: drbd1: uuid_compare()=1 by rule 7
> Dec 13 11:47:57 xen01 kernel: drbd1: peer( Unknown -> Secondary ) conn(
> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFBitMapS -> SyncSource )
> pdsk( UpToDate -> Inconsistent )
> Dec 13 11:47:57 xen01 kernel: drbd1: Began resync as SyncSource (will
> sync 82348 KB [20587 bits set]).
> Dec 13 11:48:03 xen01 kernel: drbd1: Resync done (total 6 sec; paused 0
> sec; 13724 K/sec)
> Dec 13 11:48:03 xen01 kernel: drbd1: conn( SyncSource -> Connected )
> pdsk( Inconsistent -> UpToDate )
> Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
> time expired, ko = 4294967295
> 
> All others finished fine:
> 
> Dec 13 00:05:48 xen01 kernel: drbd2: conn( Connected -> VerifyS )
> Dec 13 01:35:08 xen01 kernel: drbd2: Online verify  done (total 5359
> sec; paused 0 sec; 9780 K/sec)
> Dec 13 01:35:08 xen01 kernel: drbd2: conn( VerifyS -> Connected )
> 
> Dec 13 00:05:48 xen01 kernel: drbd3: conn( Connected -> VerifyS )
> Dec 13 00:25:18 xen01 kernel: drbd3: Online verify  done (total 1170
> sec; paused 0 sec; 8960 K/sec)
> Dec 13 00:25:18 xen01 kernel: drbd3: conn( VerifyS -> Connected )
> 
> Dec 13 00:05:48 xen01 kernel: drbd4: conn( Connected -> VerifyS )
> Dec 13 00:41:45 xen01 kernel: drbd4: Online verify  done (total 2157
> sec; paused 0 sec; 9720 K/sec)
> Dec 13 00:41:45 xen01 kernel: drbd4: conn( VerifyS -> Connected )
> 
> My intervention started at 11:27.
> 
> >> I was able to bring the guest back, by putting both nodes in
> >> secondary and then the primary back to primary.
> > 
> > then it is definetly not the same thing.
> > and I wonder why that would have helped.
> 
> I failed to mention that I fist tried to shut the guest down, once both
> systems were in secondary mode it finally shut down.  Set back to to
> primary to bring the guest back up.

now, how can you make drbd secondary, if the guest is still using it?
that should not work.

there is something "interessting" in your setup.

> >> Are you confident this is fixed in 8.3?
> > 
> > as I don't know what "it" is,
> > I cannot say.
> > 
> >> I can provide any information you may need, but our setup is the same
> >> except for xen hypervisor, guests are on lvm/drbd.  One message from
> >> the guest that was hung:
> >>
> >> Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
> >> time expired, ko = 4294967295
> > 
> > and this is something completely different.
> 
> > these messages appear when a Primary is not able to get _data_ through
> > to the other node, but it still responds timely on non-data drbd packets.
> > 
> > which means that your secondary was apparently so busy that its IO
> > subsystem did not serve data requests in time, or the drbd receiver
> > thread on the secondary got stuck somewhere.
> > 
> > and if it does not count down (4294967295, 94, 93, 92, etc.),
> > but "occasionally" saying "ko = 4294967295",
> > then it was not stuck at all, but (very) slowly making progress, still.
> > 
> > see also the ko-count config option.
> > 
> > so maybe there is no bug in drbd at all, in your situation, but "just"
> > an overloaded secondary, or maybe a network stack on memory pressure.
> 
> Our secondary is a heartbeat failover system.

now, what is that supposed to mean?

everything is possible...
I cannot say.

you have to reproduce,
and find out what is going on,
using whatever you have available,
netstat, tcpdump, kernel stack traces via sysrq,
find out which drbd threads are doing what,
and so on.

or hire someone to do that for you.

or live with it.

sorry that I cannot help further,
but there is not too much useful information to go on.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list