Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
>> Sorry to add a me too reply to this thread, but we had the exact same >> thing happen this weekend, except out system hung at 54% and we use >> 8.2.7. > > hm. > > now, you may have seen similar symptoms. > but when you say "exact same thing", > would you please tell me what > exactly that same thing is, > that is _what_ symptoms you see, > so I can figure if it maybe has a similar _cause_ as well? Similar symtoms. Auto verify initiated, one guest "hung" unable to access the guest. The auto verify was "stuck" at 54% after 11 1/2 hours. Here is the log for drbd1: Dec 13 00:05:48 xen01 kernel: drbd1: conn( Connected -> VerifyS ) Dec 13 11:28:52 xen01 kernel: drbd1: peer( Secondary -> Unknown ) conn( VerifyS -> TearDown ) pdsk( UpToDate -> DUnknown ) Dec 13 11:28:52 xen01 kernel: drbd1: Creating new current UUID Dec 13 11:28:52 xen01 kernel: drbd1: meta connection shut down by peer. Dec 13 11:28:52 xen01 kernel: drbd1: asender terminated Dec 13 11:28:52 xen01 kernel: drbd1: Terminating asender thread Dec 13 11:28:52 xen01 kernel: drbd1: Connection closed Dec 13 11:28:52 xen01 kernel: drbd1: conn( TearDown -> Unconnected ) Dec 13 11:28:52 xen01 kernel: drbd1: receiver terminated Dec 13 11:28:52 xen01 kernel: drbd1: Restarting receiver thread Dec 13 11:28:52 xen01 kernel: drbd1: receiver (re)started Dec 13 11:28:52 xen01 kernel: drbd1: conn( Unconnected -> WFConnection ) Dec 13 11:47:57 xen01 kernel: drbd1: Handshake successful: Agreed network protocol version 88 Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFConnection -> WFReportParams ) Dec 13 11:47:57 xen01 kernel: drbd1: Starting asender thread (from drbd1_receiver [14822]) Dec 13 11:47:57 xen01 kernel: drbd1: data-integrity-alg: <not-used> Dec 13 11:47:57 xen01 kernel: drbd1: drbd_sync_handshake: Dec 13 11:47:57 xen01 kernel: drbd1: self 7D59777BC5085D57:D1844DEFB56C9A8B:8F9134EABA18E3E0:19D0B2E43F10CCE5 Dec 13 11:47:57 xen01 kernel: drbd1: peer D1844DEFB56C9A8A:0000000000000000:8F9134EABA18E3E0:19D0B2E43F10CCE5 Dec 13 11:47:57 xen01 kernel: drbd1: uuid_compare()=1 by rule 7 Dec 13 11:47:57 xen01 kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Dec 13 11:47:57 xen01 kernel: drbd1: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Dec 13 11:47:57 xen01 kernel: drbd1: Began resync as SyncSource (will sync 82348 KB [20587 bits set]). Dec 13 11:48:03 xen01 kernel: drbd1: Resync done (total 6 sec; paused 0 sec; 13724 K/sec) Dec 13 11:48:03 xen01 kernel: drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg time expired, ko = 4294967295 All others finished fine: Dec 13 00:05:48 xen01 kernel: drbd2: conn( Connected -> VerifyS ) Dec 13 01:35:08 xen01 kernel: drbd2: Online verify done (total 5359 sec; paused 0 sec; 9780 K/sec) Dec 13 01:35:08 xen01 kernel: drbd2: conn( VerifyS -> Connected ) Dec 13 00:05:48 xen01 kernel: drbd3: conn( Connected -> VerifyS ) Dec 13 00:25:18 xen01 kernel: drbd3: Online verify done (total 1170 sec; paused 0 sec; 8960 K/sec) Dec 13 00:25:18 xen01 kernel: drbd3: conn( VerifyS -> Connected ) Dec 13 00:05:48 xen01 kernel: drbd4: conn( Connected -> VerifyS ) Dec 13 00:41:45 xen01 kernel: drbd4: Online verify done (total 2157 sec; paused 0 sec; 9720 K/sec) Dec 13 00:41:45 xen01 kernel: drbd4: conn( VerifyS -> Connected ) My intervention started at 11:27. >> I was able to bring the guest back, by putting both nodes in >> secondary and then the primary back to primary. > > then it is definetly not the same thing. > and I wonder why that would have helped. I failed to mention that I fist tried to shut the guest down, once both systems were in secondary mode it finally shut down. Set back to to primary to bring the guest back up. >> Are you confident this is fixed in 8.3? > > as I don't know what "it" is, > I cannot say. > >> I can provide any information you may need, but our setup is the same >> except for xen hypervisor, guests are on lvm/drbd. One message from >> the guest that was hung: >> >> Dec 15 06:30:56 xen01 kernel: drbd1: [drbd1_worker/14794] sock_sendmsg >> time expired, ko = 4294967295 > > and this is something completely different. > these messages appear when a Primary is not able to get _data_ through > to the other node, but it still responds timely on non-data drbd packets. > > which means that your secondary was apparently so busy that its IO > subsystem did not serve data requests in time, or the drbd receiver > thread on the secondary got stuck somewhere. > > and if it does not count down (4294967295, 94, 93, 92, etc.), > but "occasionally" saying "ko = 4294967295", > then it was not stuck at all, but (very) slowly making progress, still. > > see also the ko-count config option. > > so maybe there is no bug in drbd at all, in your situation, but "just" > an overloaded secondary, or maybe a network stack on memory pressure. Our secondary is a heartbeat failover system. Linux 2.6.22.19-0.1-xen (geeko at buildhost) (gcc 4.2.1) #1 SMP 2008-10-14 22:17:43 +0200 8CPU [xen02.] Memory: Total Used Free Shared Buffers Mem: 995328 984856 10472 0 304816 Swap: 995944 264 995680 Bootup: Wed Dec 3 16:20:26 2008 Load average: 0.25 0.17 0.10 1/208 17144 user : 0:09:32.19 0.0% page in : 17489402 disk 1: 2895779r13736405w nice : 0:00:06.24 0.0% page out: 61642457 disk 2: 3426590r13691503w system: 2:26:22.54 0.1% page act: 952661 disk 3: 739460r 216476w IOwait: 1:01:23.80 0.0% page dea: 659094 hw irq: 0:00:00.00 0.0% page flt: 14469038 sw irq: 0:01:48.80 0.0% swap in : 0 idle : 101d 12:22:36.01 99.8% swap out: 70 uptime: 12d 17:01:16.12 context : 407227569 Perhaps it is memory related? > not "the same thing" as an internal deadlock on some sequence counter > due to a missing wake_up call because of drbd packets on the meta socket > overtaking drbd packets on the data socket and forgetting to update that > sequence counter. not at all.