Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Jan 28, 2009 at 10:49:25AM +0000, Igor Neves wrote: > Hi, > > Take a look in this one, this happen during this night, on the other > node machine, that was not running vmware. sorry, reference to vmware in my first mail may have been premature. what I said in my second mail (which you quoted below), commenting the backtrace, is independend of vmware. and it is the same in this backtrace, again. so, I gave you all hints I was able to find already in that mail. follow up on those. check whether the mentioned offending commit is part of your redhat kernel. try to reproduce with a kernel.org 2.6.18.latest. I still say it has nothing to do with DRBD, maybe apart from DRBD changing memory pressure, request size and timing, and using network and disk at the same time. > I have also changed the soft lockup threshold to 60 seconds, but not helped. > > BUG: soft lockup - CPU#3 stuck for 60s! [drbd0_receiver:8329] > > Pid: 8329, comm: drbd0_receiver > EIP: 0060:[<c0608e1b>] CPU: 3 > EIP is at _spin_lock_irqsave+0x13/0x27 > EFLAGS: 00000282 Tainted: GF (2.6.18-92.1.22.el5PAE #1) > EAX: f79e0028 EBX: ef110080 ECX: f7c3c000 EDX: 00000202 > ESI: f79e0000 EDI: f7c3c094 EBP: 00000001 DS: 007b ES: 007b > CR0: 8005003b CR2: 08cd0000 CR3: 37804d40 CR4: 000006f0 > [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod] > [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod] > [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod] > [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod] > [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod] > [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod] > [<c04d8a34>] blk_done_softirq+0x4d/0x58 > [<c042ab5a>] __do_softirq+0x5a/0xbb > [<c0407451>] do_softirq+0x52/0x9d > [<c04073f6>] do_IRQ+0xa5/0xae > [<c040592e>] common_interrupt+0x1a/0x20 > [<c046bf5a>] kfree+0x68/0x6c > [<c05aa45c>] kfree_skbmem+0x8/0x61 > [<c05d3906>] tcp_recvmsg+0x68a/0x9f7 > [<c0608e47>] _spin_lock_bh+0x8/0x18 > [<c05a725e>] sock_common_recvmsg+0x2f/0x45 > [<c05a5017>] sock_recvmsg+0xe5/0x100 > [<c0436347>] autoremove_wake_function+0x0/0x2d > [<c05a725e>] sock_common_recvmsg+0x2f/0x45 > [<c0436347>] autoremove_wake_function+0x0/0x2d > [<c0455fb0>] mempool_alloc+0x28/0xc9 > [<c04750de>] bio_add_page+0x25/0x2b > [<f8b83485>] drbd_recv+0x57/0xd7 [drbd] > [<f8b8536a>] read_in_block+0x7f/0xff [drbd] > [<f8b87dde>] receive_Data+0x135/0x9a2 [drbd] > [<f8b83485>] drbd_recv+0x57/0xd7 [drbd] > [<f8b83c95>] drbdd+0x62/0x12b [drbd] > [<f8b86586>] drbdd_init+0xa0/0x173 [drbd] > [<f8b94e47>] drbd_thread_setup+0xbb/0x150 [drbd] > [<f8b94d8c>] drbd_thread_setup+0x0/0x150 [drbd] > [<c0405c3b>] kernel_thread_helper+0x7/0x10 > > > Lars Ellenberg wrote: > > On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote: > > > >>> thanks for mentioning this. > >>> > >> > >> > >>> Does anyone know what this can be? > >>> > >>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos > >>> 5.2 machine. > >>> > >>> > >>> Not a DRBD problem. > >>> Appears to be a problem in the redhat kernel on vmware. > >>> please have a look at > >>> https://bugzilla.redhat.com/show_bug.cgi?id=463573 > >>> > >>> > >> Thanks for point me out this bug, but I think we are speaking of > >> different things. This bugs mention vmware machine as guest, this does > >> not happen on the guest but on the host. Guest it's one windows machine. > >> > >> One more point, I had this vmware working on other machine without > >> problems. Can this be interrupts? > >> > > > > > > it definetely has something to do with interrupts, > > as the stack trace you provided hints at a > > spinlock deadlock in bottom half context, > > or a livelock within the generic scsi layer. > > > > > >> Here is the interrupts table: > >> > >> CPU0 CPU1 CPU2 CPU3 > >> 0: 85216538 85180125 85220346 85160716 IO-APIC-edge timer > >> 1: 8 1 1 0 IO-APIC-edge i8042 > >> 4: 32854 32895 32997 32828 IO-APIC-edge serial > >> 7: 1 1 0 0 IO-APIC-edge parport0 > >> 8: 0 1 0 0 IO-APIC-edge rtc > >> 9: 0 0 1 0 IO-APIC-level acpi > >> 50: 0 0 0 0 PCI-MSI ahci > >> 58: 1017131 1001386 1008608 1007834 IO-APIC-level skge > >> 66: 2995867 2969551 2982197 2975044 IO-APIC-level eth3 > >> 74: 1431195 1496518 1426694 1506294 IO-APIC-level eth4 > >> 82: 35769244 0 0 0 PCI-MSI eth0 > >> 90: 34243658 0 0 0 PCI-MSI eth1 > >> 233: 2817474 2829933 2839738 2827824 IO-APIC-level arcmsr > >> NMI: 0 0 0 0 > >> LOC: 337373327 337372791 336681613 336681080 > >> ERR: 0 > >> MIS: 0 > >> > >> I wonder if skge and r8169 drivers are making problems with interrupts > >> and drbd don't like it or even arcmsr that it's the areca controller > >> storage driver. > >> > > > > I doubt that it has anything to do with drbd (other than drbd causing > > disk and network load at the same time, and maybe altering timing and > > latencies). > > > > I'll try to reason that by commenting the stack trace below. > > > > > >>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0] > >>>> > >>>> Pid: 0, comm: swapper > >>>> EIP: 0060:[<c0608e1b>] CPU: 0 > >>>> EIP is at _spin_lock_irqsave+0x13/0x27 > >>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) > >>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246 > >>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b > >>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0 > >>>> [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod] > >>>> [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod] > >>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 > >>>> [<c042ab5a>] __do_softirq+0x5a/0xbb > >>>> [<c0407451>] do_softirq+0x52/0x9d > >>>> [<c04073f6>] do_IRQ+0xa5/0xae > >>>> [<c040592e>] common_interrupt+0x1a/0x20 > >>>> [<c0403ccf>] mwait_idle+0x25/0x38 > >>>> [<c0403c90>] cpu_idle+0x9f/0xb9 > >>>> [<c06eb9ee>] start_kernel+0x379/0x380 > >>>> > > > > > > see, no mention of drbd in there. > > something tries to get a spinlock in softirq context, > > but cannot, probably because it is held by someone in process context, > > and that someone did not disable interrupts where it should have done, > > or something re-enabled interrupts early. > > or because the below mentioned livelock on the other cpu. > > > > > >>>> ======================= > >>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880] > >>>> > >>>> Pid: 4880, comm: drbd0_receiver > >>>> EIP: 0060:[<c0608e1b>] CPU: 1 > >>>> EIP is at _spin_lock_irqsave+0x13/0x27 > >>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) > >>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202 > >>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b > >>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0 > >>>> [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod] > >>>> [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod] > >>>> [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod] > >>>> [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod] > >>>> [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod] > >>>> [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod] > >>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 > >>>> [<c042ab5a>] __do_softirq+0x5a/0xbb > >>>> [<c0407451>] do_softirq+0x52/0x9d > >>>> [<c042a961>] local_bh_enable+0x74/0x7f > >>>> > > > > stack trace above is a soft irq, > > which interrupted the current active process on this cpu, > > which happened to be the drbd receiver. > > but it has nothing directly to do with drbd > > (other that it is likely that DRBD submitted the request > > in the first place). > > > > some scsi thing finished, triggered a softirq, > > and within ./drivers/scsi/scsi_lib.c:scsi_run_queue() > > it either blocks on spin_lock_irqsave(shost->host_lock, flags), > > or livelocks within that while loop there, > > possibly while the other cpu (see above) > > sits on the same spin_lock within > > ./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev); > > > > this looks more like a problem in the generic scsi layer. > > maybe it even has been fixed upstream, > > http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9 > > sounds a bit similar (just skip forward to that comment 9, > > the initial reports/logs/comments are not interessting). > > maybe the readhat kernel has a similar problem, > > by "backporting" the upstream commit > > f0c0a376d0fcd4c5579ecf5e95f88387cba85211, > > which according to above mentioned bug broke it... > > > > otherwise, stress the box without DRBD. > > try a different kernel (as you are with 2.6.18-RHEL, > > start with a "kernel.org" 2.6.18). > > do a kernel bisect, if you find > > some kernel show this behaviour, and others do not. > > > > again, this has nothing to do with DRBD. > > > > > > stack trace below is DRBD, > > going into the tcp stack, waiting for something to receive. > > nothing unusual there. > > > > > >>>> [<c05d1407>] tcp_prequeue_process+0x5a/0x66 > >>>> [<c05d3692>] tcp_recvmsg+0x416/0x9f7 > >>>> [<c05a725e>] sock_common_recvmsg+0x2f/0x45 > >>>> [<c05a5017>] sock_recvmsg+0xe5/0x100 > >>>> [<c0436347>] autoremove_wake_function+0x0/0x2d > >>>> [<c05a6c97>] kernel_sendmsg+0x27/0x35 > >>>> [<f8d87718>] drbd_send+0x77/0x13f [drbd] > >>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] > >>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] > >>>> [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd] > >>>> [<f8d78c4b>] drbdd+0x18/0x12b [drbd] > >>>> [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd] > >>>> [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd] > >>>> [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd] > >>>> [<c0405c3b>] kernel_thread_helper+0x7/0x10 > >>>> > > > > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed