[DRBD-user] Kernel Panic

Sun Feb 1 13:55:03 CET 2009

Hi Igor,

I was asking about the RAID hardware because I had similar (or at  
least similar looking) problems with DRBD 8.0.13, Kernel 2.6.18-6  
(Debian Etch) and CPU soft lockups under high disk loads:
(http://www.gossamer-threads.com/lists/drbd/users/16399?search_string=flush;#16399 
)

Unfortunately, I did not have the time to do exhaustive testing (i.e.  
isolate the problem, reproduce without DRBD, change specific software  
versions,  etc.) as I was under some pressure to go productive. What I  
did was trying to eliminate as many potential causes of the problems  
as possible, which meant upgrading to a newer kernel version and  
backporting the RAID controller driver, as well as upgrading to DRBD  
8.0.14.

I have been doing some pretty extensive stress testing after that, and  
the system is running fine without any problems since then. My  
strongest bet concerning the cause of the problem is still the used  
RAID controller (LSI 1078), as driver versions in kernel 2.6.18 are  
known to cause problems under high load. But of course, I can't be  
sure. And as I'm now reading that you have similar problems even  
though you are using a different raid card, I'm wondering if there  
could be any problems with DRBD and that specific kernel version? Is  
it possible for you do upgrade the kernel and do some testing?

Best regards,

   Thomas

Am 27.01.2009 um 17:47 schrieb Igor Neves:

> Hi,
>
> Thomas Reinhold wrote:
>>
>> Hi,
>>
>> what kind of RAID hardware are you using?
>
> I'm using Areca's controller. But I'm pretty shore it's not Areca,  
> because we have some more like this in production.
>
>>
>> Regards,
>>
>>   Thomas
>>
>> Am 26.01.2009 um 18:18 schrieb Igor Neves:
>>
>>> Hi,
>>>
>>> Thanks for your help.
>>>
>>> Lars Ellenberg wrote:
>>>>
>>>> On Mon, Jan 26, 2009 at 03:12:02PM +0000, Igor Neves wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm having hard problems with one machine with drbd, ans having  
>>>>> this
>>>>> kernel panic's
>>>>>
>>>> these are NOT kernel panics...
>>>> even though the may look very similar.
>>>>
>>>
>>> Yes, you are right, but they kill my machine like a kernel panic! :)
>>>
>>>>
>>>>> on the machine I run the vmware server.
>>>>>
>>>> thanks for mentioning this.
>>>
>>>> Does anyone know what this can be?
>>>>
>>>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a  
>>>> Centos
>>>> 5.2 machine.
>>>>
>>>> Not a DRBD problem.
>>>> Appears to be a problem in the redhat kernel on vmware.
>>>> please have a look at
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=463573
>>>>
>>>
>>> Thanks for point me out this bug, but I think we are speaking of  
>>> different things. This bugs mention vmware machine as guest, this  
>>> does not happen on the guest but on the host. Guest it's one  
>>> windows machine.
>>>
>>> One more point, I had this vmware working on other machine without  
>>> problems. Can this be interrupts?
>>>
>>> Here is the interrupts table:
>>>
>>>            CPU0       CPU1       CPU2       CPU3
>>>   0:   85216538   85180125   85220346   85160716    IO-APIC-edge   
>>> timer
>>>   1:          8          1          1          0    IO-APIC-edge   
>>> i8042
>>>   4:      32854      32895      32997      32828    IO-APIC-edge   
>>> serial
>>>   7:          1          1          0          0    IO-APIC-edge   
>>> parport0
>>>   8:          0          1          0          0    IO-APIC-edge   
>>> rtc
>>>   9:          0          0          1          0   IO-APIC-level   
>>> acpi
>>>  50:          0          0          0          0         PCI-MSI   
>>> ahci
>>>  58:    1017131    1001386    1008608    1007834   IO-APIC-level   
>>> skge
>>>  66:    2995867    2969551    2982197    2975044   IO-APIC-level   
>>> eth3
>>>  74:    1431195    1496518    1426694    1506294   IO-APIC-level   
>>> eth4
>>>  82:   35769244          0          0          0         PCI-MSI   
>>> eth0
>>>  90:   34243658          0          0          0         PCI-MSI   
>>> eth1
>>> 233:    2817474    2829933    2839738    2827824   IO-APIC-level   
>>> arcmsr
>>> NMI:          0          0          0          0
>>> LOC:  337373327  337372791  336681613  336681080
>>> ERR:          0
>>> MIS:          0
>>>
>>> I wonder if skge and r8169 drivers are making problems with  
>>> interrupts and drbd don't like it or even arcmsr that it's the  
>>> areca controller storage driver.
>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>  =======================
>>>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
>>>>>
>>>>> Pid: 0, comm:              swapper
>>>>> EIP: 0060:[<c0608e1b>] CPU: 0
>>>>> EIP is at _spin_lock_irqsave+0x13/0x27
>>>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
>>>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246
>>>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b
>>>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0
>>>>>  [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]
>>>>>  [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]
>>>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
>>>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
>>>>>  [<c0407451>] do_softirq+0x52/0x9d
>>>>>  [<c04073f6>] do_IRQ+0xa5/0xae
>>>>>  [<c040592e>] common_interrupt+0x1a/0x20
>>>>>  [<c0403ccf>] mwait_idle+0x25/0x38
>>>>>  [<c0403c90>] cpu_idle+0x9f/0xb9
>>>>>  [<c06eb9ee>] start_kernel+0x379/0x380
>>>>>  =======================
>>>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]
>>>>>
>>>>> Pid: 4880, comm:       drbd0_receiver
>>>>> EIP: 0060:[<c0608e1b>] CPU: 1
>>>>> EIP is at _spin_lock_irqsave+0x13/0x27
>>>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
>>>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202
>>>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b
>>>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0
>>>>>  [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
>>>>>  [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
>>>>>  [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
>>>>>  [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
>>>>>  [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
>>>>>  [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
>>>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
>>>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
>>>>>  [<c0407451>] do_softirq+0x52/0x9d
>>>>>  [<c042a961>] local_bh_enable+0x74/0x7f
>>>>>  [<c05d1407>] tcp_prequeue_process+0x5a/0x66
>>>>>  [<c05d3692>] tcp_recvmsg+0x416/0x9f7
>>>>>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
>>>>>  [<c05a5017>] sock_recvmsg+0xe5/0x100
>>>>>  [<c0436347>] autoremove_wake_function+0x0/0x2d
>>>>>  [<c05a6c97>] kernel_sendmsg+0x27/0x35
>>>>>  [<f8d87718>] drbd_send+0x77/0x13f [drbd]
>>>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
>>>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
>>>>>  [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd]
>>>>>  [<f8d78c4b>] drbdd+0x18/0x12b [drbd]
>>>>>  [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd]
>>>>>  [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]
>>>>>  [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]
>>>>>  [<c0405c3b>] kernel_thread_helper+0x7/0x10
>>>>>
>>>>
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090201/b41fbe1a/attachment.htm>