[DRBD-user] Kernel Panic

Tue Jan 27 17:45:48 CET 2009

Hi,

Thanks so much for your help.

Lars Ellenberg wrote:
> On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote:
>   
>>> thanks for mentioning this.
>>>       
>>  
>>     
>>> Does anyone know what this can be?
>>>
>>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos
>>> 5.2 machine.
>>>   
>>>
>>> Not a DRBD problem.
>>> Appears to be a problem in the redhat kernel on vmware.
>>> please have a look at
>>> https://bugzilla.redhat.com/show_bug.cgi?id=463573
>>>   
>>>       
>> Thanks for point me out this bug, but I think we are speaking of
>> different things. This bugs mention vmware machine as guest, this does
>> not happen on the guest but on the host. Guest it's one windows machine.
>>
>> One more point, I had this vmware working on other machine without
>> problems. Can this be interrupts?
>>     
>
>
> it definetely has something to do with interrupts,
> as the stack trace you provided hints at a
> spinlock deadlock in bottom half context,
> or a livelock within the generic scsi layer.
>   
>   
>> Here is the interrupts table:
>>
>>            CPU0       CPU1       CPU2       CPU3
>>   0:   85216538   85180125   85220346   85160716    IO-APIC-edge  timer
>>   1:          8          1          1          0    IO-APIC-edge  i8042
>>   4:      32854      32895      32997      32828    IO-APIC-edge  serial
>>   7:          1          1          0          0    IO-APIC-edge  parport0
>>   8:          0          1          0          0    IO-APIC-edge  rtc
>>   9:          0          0          1          0   IO-APIC-level  acpi
>>  50:          0          0          0          0         PCI-MSI  ahci
>>  58:    1017131    1001386    1008608    1007834   IO-APIC-level  skge
>>  66:    2995867    2969551    2982197    2975044   IO-APIC-level  eth3
>>  74:    1431195    1496518    1426694    1506294   IO-APIC-level  eth4
>>  82:   35769244          0          0          0         PCI-MSI  eth0
>>  90:   34243658          0          0          0         PCI-MSI  eth1
>> 233:    2817474    2829933    2839738    2827824   IO-APIC-level  arcmsr
>> NMI:          0          0          0          0
>> LOC:  337373327  337372791  336681613  336681080
>> ERR:          0
>> MIS:          0
>>
>> I wonder if skge and r8169 drivers are making problems with interrupts
>> and drbd don't like it or even arcmsr that it's the areca controller
>> storage driver.
>>     
>
> I doubt that it has anything to do with drbd (other than drbd causing
> disk and network load at the same time, and maybe altering timing and
> latencies).
>
> I'll try to reason that by commenting the stack trace below.
>
>   
>>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
>>>>
>>>> Pid: 0, comm:              swapper
>>>> EIP: 0060:[<c0608e1b>] CPU: 0
>>>> EIP is at _spin_lock_irqsave+0x13/0x27
>>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
>>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246
>>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b
>>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0
>>>>  [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]
>>>>  [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]
>>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
>>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
>>>>  [<c0407451>] do_softirq+0x52/0x9d
>>>>  [<c04073f6>] do_IRQ+0xa5/0xae
>>>>  [<c040592e>] common_interrupt+0x1a/0x20
>>>>  [<c0403ccf>] mwait_idle+0x25/0x38
>>>>  [<c0403c90>] cpu_idle+0x9f/0xb9
>>>>  [<c06eb9ee>] start_kernel+0x379/0x380
>>>>         
>
>
> see, no mention of drbd in there.
> something tries to get a spinlock in softirq context,
> but cannot, probably because it is held by someone in process context,
> and that someone did not disable interrupts where it should have done,
> or something re-enabled interrupts early.
> or because the below mentioned livelock on the other cpu.
>
>   
>>>>  =======================
>>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]
>>>>
>>>> Pid: 4880, comm:       drbd0_receiver
>>>> EIP: 0060:[<c0608e1b>] CPU: 1
>>>> EIP is at _spin_lock_irqsave+0x13/0x27
>>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
>>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202
>>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b
>>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0
>>>>  [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
>>>>  [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
>>>>  [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
>>>>  [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
>>>>  [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
>>>>  [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
>>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
>>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
>>>>  [<c0407451>] do_softirq+0x52/0x9d
>>>>  [<c042a961>] local_bh_enable+0x74/0x7f
>>>>         
>
> stack trace above is a soft irq,
> which interrupted the current active process on this cpu,
> which happened to be the drbd receiver.
> but it has nothing directly to do with drbd
> (other that it is likely that DRBD submitted the request
> in the first place).
>
> some scsi thing finished, triggered a softirq,
> and within ./drivers/scsi/scsi_lib.c:scsi_run_queue()
> it either blocks on spin_lock_irqsave(shost->host_lock, flags),
> or livelocks within that while loop there,
> possibly while the other cpu (see above)
> sits on the same spin_lock within
> ./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev);
>
> this looks more like a problem in the generic scsi layer.
> maybe it even has been fixed upstream,
> http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9
> sounds a bit similar (just skip forward to that comment 9,
> the initial reports/logs/comments are not interessting).
> maybe the readhat kernel has a similar problem,
> by "backporting" the upstream commit
>  f0c0a376d0fcd4c5579ecf5e95f88387cba85211,
> which according to above mentioned bug broke it...
>
> otherwise, stress the box without DRBD.
>   

Have done this, and without any problems.
I don't know exactly when this happen, but normally it's when we
generate some I/O copying files to the disk. But this only happens when
doing it over drbd and after some time (1-2 days...).

If I do it on the same controller and over the same disks without drbd
on the way, I have no problems at all. And we have this controllers
working wonderful everywhere, so as drbd.

So I must insist this have something to do with NIC's drbd uses to
replicate while it's writing, can this be true? Or I'm terrible mistaken?

> try a different kernel (as you are with 2.6.18-RHEL,
> start with a "kernel.org" 2.6.18).
> do a kernel bisect, if you find
> some kernel show this behaviour, and others do not.
>
> again, this has nothing to do with DRBD.
>
>
> stack trace below is DRBD,
> going into the tcp stack, waiting for something to receive.
> nothing unusual there.
>   

Does this stack trace stops  when waiting for receive from the network
(NIC) ?

>   
>>>>  [<c05d1407>] tcp_prequeue_process+0x5a/0x66
>>>>  [<c05d3692>] tcp_recvmsg+0x416/0x9f7
>>>>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
>>>>  [<c05a5017>] sock_recvmsg+0xe5/0x100
>>>>  [<c0436347>] autoremove_wake_function+0x0/0x2d
>>>>  [<c05a6c97>] kernel_sendmsg+0x27/0x35
>>>>  [<f8d87718>] drbd_send+0x77/0x13f [drbd]
>>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
>>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
>>>>  [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd]
>>>>  [<f8d78c4b>] drbdd+0x18/0x12b [drbd]
>>>>  [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd]
>>>>  [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]
>>>>  [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]
>>>>  [<c0405c3b>] kernel_thread_helper+0x7/0x10
>>>>         
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090127/72186e0a/attachment.htm>