Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Igor, I was asking about the RAID hardware because I had similar (or at least similar looking) problems with DRBD 8.0.13, Kernel 2.6.18-6 (Debian Etch) and CPU soft lockups under high disk loads: (http://www.gossamer-threads.com/lists/drbd/users/16399?search_string=flush;#16399 ) Unfortunately, I did not have the time to do exhaustive testing (i.e. isolate the problem, reproduce without DRBD, change specific software versions, etc.) as I was under some pressure to go productive. What I did was trying to eliminate as many potential causes of the problems as possible, which meant upgrading to a newer kernel version and backporting the RAID controller driver, as well as upgrading to DRBD 8.0.14. I have been doing some pretty extensive stress testing after that, and the system is running fine without any problems since then. My strongest bet concerning the cause of the problem is still the used RAID controller (LSI 1078), as driver versions in kernel 2.6.18 are known to cause problems under high load. But of course, I can't be sure. And as I'm now reading that you have similar problems even though you are using a different raid card, I'm wondering if there could be any problems with DRBD and that specific kernel version? Is it possible for you do upgrade the kernel and do some testing? Best regards, Thomas Am 27.01.2009 um 17:47 schrieb Igor Neves: > Hi, > > Thomas Reinhold wrote: >> >> Hi, >> >> what kind of RAID hardware are you using? > > I'm using Areca's controller. But I'm pretty shore it's not Areca, > because we have some more like this in production. > >> >> Regards, >> >> Thomas >> >> Am 26.01.2009 um 18:18 schrieb Igor Neves: >> >>> Hi, >>> >>> Thanks for your help. >>> >>> Lars Ellenberg wrote: >>>> >>>> On Mon, Jan 26, 2009 at 03:12:02PM +0000, Igor Neves wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm having hard problems with one machine with drbd, ans having >>>>> this >>>>> kernel panic's >>>>> >>>> these are NOT kernel panics... >>>> even though the may look very similar. >>>> >>> >>> Yes, you are right, but they kill my machine like a kernel panic! :) >>> >>>> >>>>> on the machine I run the vmware server. >>>>> >>>> thanks for mentioning this. >>> >>>> Does anyone know what this can be? >>>> >>>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a >>>> Centos >>>> 5.2 machine. >>>> >>>> Not a DRBD problem. >>>> Appears to be a problem in the redhat kernel on vmware. >>>> please have a look at >>>> https://bugzilla.redhat.com/show_bug.cgi?id=463573 >>>> >>> >>> Thanks for point me out this bug, but I think we are speaking of >>> different things. This bugs mention vmware machine as guest, this >>> does not happen on the guest but on the host. Guest it's one >>> windows machine. >>> >>> One more point, I had this vmware working on other machine without >>> problems. Can this be interrupts? >>> >>> Here is the interrupts table: >>> >>> CPU0 CPU1 CPU2 CPU3 >>> 0: 85216538 85180125 85220346 85160716 IO-APIC-edge >>> timer >>> 1: 8 1 1 0 IO-APIC-edge >>> i8042 >>> 4: 32854 32895 32997 32828 IO-APIC-edge >>> serial >>> 7: 1 1 0 0 IO-APIC-edge >>> parport0 >>> 8: 0 1 0 0 IO-APIC-edge >>> rtc >>> 9: 0 0 1 0 IO-APIC-level >>> acpi >>> 50: 0 0 0 0 PCI-MSI >>> ahci >>> 58: 1017131 1001386 1008608 1007834 IO-APIC-level >>> skge >>> 66: 2995867 2969551 2982197 2975044 IO-APIC-level >>> eth3 >>> 74: 1431195 1496518 1426694 1506294 IO-APIC-level >>> eth4 >>> 82: 35769244 0 0 0 PCI-MSI >>> eth0 >>> 90: 34243658 0 0 0 PCI-MSI >>> eth1 >>> 233: 2817474 2829933 2839738 2827824 IO-APIC-level >>> arcmsr >>> NMI: 0 0 0 0 >>> LOC: 337373327 337372791 336681613 336681080 >>> ERR: 0 >>> MIS: 0 >>> >>> I wonder if skge and r8169 drivers are making problems with >>> interrupts and drbd don't like it or even arcmsr that it's the >>> areca controller storage driver. >>> >>>> >>>>> Thanks >>>>> >>>>> ======================= >>>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0] >>>>> >>>>> Pid: 0, comm: swapper >>>>> EIP: 0060:[<c0608e1b>] CPU: 0 >>>>> EIP is at _spin_lock_irqsave+0x13/0x27 >>>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) >>>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246 >>>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b >>>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0 >>>>> [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod] >>>>> [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod] >>>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 >>>>> [<c042ab5a>] __do_softirq+0x5a/0xbb >>>>> [<c0407451>] do_softirq+0x52/0x9d >>>>> [<c04073f6>] do_IRQ+0xa5/0xae >>>>> [<c040592e>] common_interrupt+0x1a/0x20 >>>>> [<c0403ccf>] mwait_idle+0x25/0x38 >>>>> [<c0403c90>] cpu_idle+0x9f/0xb9 >>>>> [<c06eb9ee>] start_kernel+0x379/0x380 >>>>> ======================= >>>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880] >>>>> >>>>> Pid: 4880, comm: drbd0_receiver >>>>> EIP: 0060:[<c0608e1b>] CPU: 1 >>>>> EIP is at _spin_lock_irqsave+0x13/0x27 >>>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) >>>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202 >>>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b >>>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0 >>>>> [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod] >>>>> [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod] >>>>> [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod] >>>>> [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod] >>>>> [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod] >>>>> [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod] >>>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 >>>>> [<c042ab5a>] __do_softirq+0x5a/0xbb >>>>> [<c0407451>] do_softirq+0x52/0x9d >>>>> [<c042a961>] local_bh_enable+0x74/0x7f >>>>> [<c05d1407>] tcp_prequeue_process+0x5a/0x66 >>>>> [<c05d3692>] tcp_recvmsg+0x416/0x9f7 >>>>> [<c05a725e>] sock_common_recvmsg+0x2f/0x45 >>>>> [<c05a5017>] sock_recvmsg+0xe5/0x100 >>>>> [<c0436347>] autoremove_wake_function+0x0/0x2d >>>>> [<c05a6c97>] kernel_sendmsg+0x27/0x35 >>>>> [<f8d87718>] drbd_send+0x77/0x13f [drbd] >>>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] >>>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] >>>>> [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd] >>>>> [<f8d78c4b>] drbdd+0x18/0x12b [drbd] >>>>> [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd] >>>>> [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd] >>>>> [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd] >>>>> [<c0405c3b>] kernel_thread_helper+0x7/0x10 >>>>> >>>> >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090201/b41fbe1a/attachment.htm>