[DRBD-user] "drbdadm verify" hung after 14%.

Tue Dec 16 15:25:36 CET 2008

>> Sorry to add a me too reply to this thread, but we had the exact same
>> thing happen this weekend, except out system hung at 54% and we use
>> 8.2.7.
> 
> hm.
> 
> now, you may have seen similar symptoms.
> but when you say "exact same thing",
> would you please tell me what
> exactly that same thing is,
> that is _what_ symptoms you see,
> so I can figure if it maybe has a similar _cause_ as well?

Similar symtoms.  Auto verify initiated, one guest "hung" unable to
access the guest.  The auto verify was "stuck" at 54% after 11 1/2 hours.

Here is the log for drbd1:

Dec 13 00:05:48 xen01 kernel: drbd1: conn( Connected -> VerifyS )
Dec 13 11:28:52 xen01 kernel: drbd1: peer( Secondary -> Unknown ) conn(
VerifyS -> TearDown ) pdsk( UpToDate -> DUnknown )
Dec 13 11:28:52 xen01 kernel: drbd1: Creating new current UUID
Dec 13 11:28:52 xen01 kernel: drbd1: meta connection shut down by peer.
Dec 13 11:28:52 xen01 kernel: drbd1: asender terminated
Dec 13 11:28:52 xen01 kernel: drbd1: Terminating asender thread
Dec 13 11:28:52 xen01 kernel: drbd1: Connection closed
Dec 13 11:28:52 xen01 kernel: drbd1: conn( TearDown -> Unconnected )
Dec 13 11:28:52 xen01 kernel: drbd1: receiver terminated
Dec 13 11:28:52 xen01 kernel: drbd1: Restarting receiver thread
Dec 13 11:28:52 xen01 kernel: drbd1: receiver (re)started
Dec 13 11:28:52 xen01 kernel: drbd1: conn( Unconnected -> WFConnection )
Dec 13 11:47:57 xen01 kernel: drbd1: Handshake successful: Agreed
network protocol version 88
Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFConnection -> WFReportParams )
Dec 13 11:47:57 xen01 kernel: drbd1: Starting asender thread (from
drbd1_receiver [14822])
Dec 13 11:47:57 xen01 kernel: drbd1: data-integrity-alg: <not-used>
Dec 13 11:47:57 xen01 kernel: drbd1: drbd_sync_handshake:
Dec 13 11:47:57 xen01 kernel: drbd1: self
7D59777BC5085D57:D1844DEFB56C9A8B:8F9134EABA18E3E0:19D0B2E43F10CCE5
Dec 13 11:47:57 xen01 kernel: drbd1: peer
D1844DEFB56C9A8A:0000000000000000:8F9134EABA18E3E0:19D0B2E43F10CCE5
Dec 13 11:47:57 xen01 kernel: drbd1: uuid_compare()=1 by rule 7
Dec 13 11:47:57 xen01 kernel: drbd1: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFBitMapS -> SyncSource )
pdsk( UpToDate -> Inconsistent )
Dec 13 11:47:57 xen01 kernel: drbd1: Began resync as SyncSource (will
sync 82348 KB [20587 bits set]).
Dec 13 11:48:03 xen01 kernel: drbd1: Resync done (total 6 sec; paused 0
sec; 13724 K/sec)
Dec 13 11:48:03 xen01 kernel: drbd1: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
time expired, ko = 4294967295

All others finished fine:

Dec 13 00:05:48 xen01 kernel: drbd2: conn( Connected -> VerifyS )
Dec 13 01:35:08 xen01 kernel: drbd2: Online verify  done (total 5359
sec; paused 0 sec; 9780 K/sec)
Dec 13 01:35:08 xen01 kernel: drbd2: conn( VerifyS -> Connected )

Dec 13 00:05:48 xen01 kernel: drbd3: conn( Connected -> VerifyS )
Dec 13 00:25:18 xen01 kernel: drbd3: Online verify  done (total 1170
sec; paused 0 sec; 8960 K/sec)
Dec 13 00:25:18 xen01 kernel: drbd3: conn( VerifyS -> Connected )

Dec 13 00:05:48 xen01 kernel: drbd4: conn( Connected -> VerifyS )
Dec 13 00:41:45 xen01 kernel: drbd4: Online verify  done (total 2157
sec; paused 0 sec; 9720 K/sec)
Dec 13 00:41:45 xen01 kernel: drbd4: conn( VerifyS -> Connected )

My intervention started at 11:27.

>> I was able to bring the guest back, by putting both nodes in
>> secondary and then the primary back to primary.
> 
> then it is definetly not the same thing.
> and I wonder why that would have helped.

I failed to mention that I fist tried to shut the guest down, once both
systems were in secondary mode it finally shut down.  Set back to to
primary to bring the guest back up.

>> Are you confident this is fixed in 8.3?
> 
> as I don't know what "it" is,
> I cannot say.
> 
>> I can provide any information you may need, but our setup is the same
>> except for xen hypervisor, guests are on lvm/drbd.  One message from
>> the guest that was hung:
>>
>> Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg
>> time expired, ko = 4294967295
> 
> and this is something completely different.

> these messages appear when a Primary is not able to get _data_ through
> to the other node, but it still responds timely on non-data drbd packets.
> 
> which means that your secondary was apparently so busy that its IO
> subsystem did not serve data requests in time, or the drbd receiver
> thread on the secondary got stuck somewhere.
> 
> and if it does not count down (4294967295, 94, 93, 92, etc.),
> but "occasionally" saying "ko = 4294967295",
> then it was not stuck at all, but (very) slowly making progress, still.
> 
> see also the ko-count config option.
> 
> so maybe there is no bug in drbd at all, in your situation, but "just"
> an overloaded secondary, or maybe a network stack on memory pressure.

Our secondary is a heartbeat failover system.

Linux 2.6.22.19-0.1-xen (geeko at buildhost) (gcc 4.2.1) #1 SMP 2008-10-14
22:17:43 +0200 8CPU [xen02.]

Memory:      Total        Used        Free      Shared     Buffers
Mem:        995328      984856       10472           0      304816
Swap:       995944         264      995680

Bootup: Wed Dec  3 16:20:26 2008    Load average: 0.25 0.17 0.10 1/208 17144

user  :       0:09:32.19   0.0%  page in :   17489402  disk 1:
2895779r13736405w
nice  :       0:00:06.24   0.0%  page out:   61642457  disk 2:
3426590r13691503w
system:       2:26:22.54   0.1%  page act:     952661  disk 3:   739460r
 216476w
IOwait:       1:01:23.80   0.0%  page dea:     659094
hw irq:       0:00:00.00   0.0%  page flt:   14469038
sw irq:       0:01:48.80   0.0%  swap in :          0
idle  : 101d 12:22:36.01  99.8%  swap out:         70
uptime:  12d 17:01:16.12         context :  407227569

Perhaps it is memory related?

> not "the same thing" as an internal deadlock on some sequence counter
> due to a missing wake_up call because of drbd packets on the meta socket
> overtaking drbd packets on the data socket and forgetting to update that
> sequence counter. not at all.