[DRBD-user] Kernel Panic

Wed Jan 28 11:58:34 CET 2009

On Wed, Jan 28, 2009 at 10:49:25AM +0000, Igor Neves wrote:
> Hi,
> 
> Take a look in this one, this happen during this night, on the other
> node machine, that was not running vmware.

sorry, reference to vmware in my first mail may have been premature.

what I said in my second mail (which you quoted below),
commenting the backtrace, is independend of vmware.

and it is the same in this backtrace, again.
so, I gave you all hints I was able to find already in that mail.

follow up on those.
check whether the mentioned offending commit is part of your redhat kernel.
try to reproduce with a kernel.org 2.6.18.latest.

I still say it has nothing to do with DRBD,
maybe apart from DRBD changing memory pressure,
request size and timing, and using network and disk at the same time.

> I have also changed the soft lockup threshold to 60 seconds, but not helped.
> 
> BUG: soft lockup - CPU#3 stuck for 60s! [drbd0_receiver:8329]
> 
> Pid: 8329, comm:       drbd0_receiver
> EIP: 0060:[<c0608e1b>] CPU: 3
> EIP is at _spin_lock_irqsave+0x13/0x27
>  EFLAGS: 00000282    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
> EAX: f79e0028 EBX: ef110080 ECX: f7c3c000 EDX: 00000202
> ESI: f79e0000 EDI: f7c3c094 EBP: 00000001 DS: 007b ES: 007b
> CR0: 8005003b CR2: 08cd0000 CR3: 37804d40 CR4: 000006f0
>  [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
>  [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
>  [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
>  [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
>  [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
>  [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
>  [<c042ab5a>] __do_softirq+0x5a/0xbb
>  [<c0407451>] do_softirq+0x52/0x9d
>  [<c04073f6>] do_IRQ+0xa5/0xae
>  [<c040592e>] common_interrupt+0x1a/0x20
>  [<c046bf5a>] kfree+0x68/0x6c
>  [<c05aa45c>] kfree_skbmem+0x8/0x61
>  [<c05d3906>] tcp_recvmsg+0x68a/0x9f7
>  [<c0608e47>] _spin_lock_bh+0x8/0x18
>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
>  [<c05a5017>] sock_recvmsg+0xe5/0x100
>  [<c0436347>] autoremove_wake_function+0x0/0x2d
>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
>  [<c0436347>] autoremove_wake_function+0x0/0x2d
>  [<c0455fb0>] mempool_alloc+0x28/0xc9
>  [<c04750de>] bio_add_page+0x25/0x2b
>  [<f8b83485>] drbd_recv+0x57/0xd7 [drbd]
>  [<f8b8536a>] read_in_block+0x7f/0xff [drbd]
>  [<f8b87dde>] receive_Data+0x135/0x9a2 [drbd]
>  [<f8b83485>] drbd_recv+0x57/0xd7 [drbd]
>  [<f8b83c95>] drbdd+0x62/0x12b [drbd]
>  [<f8b86586>] drbdd_init+0xa0/0x173 [drbd]
>  [<f8b94e47>] drbd_thread_setup+0xbb/0x150 [drbd]
>  [<f8b94d8c>] drbd_thread_setup+0x0/0x150 [drbd]
>  [<c0405c3b>] kernel_thread_helper+0x7/0x10
> 
> 
> Lars Ellenberg wrote:
> > On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote:
> >   
> >>> thanks for mentioning this.
> >>>       
> >>  
> >>     
> >>> Does anyone know what this can be?
> >>>
> >>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos
> >>> 5.2 machine.
> >>>   
> >>>
> >>> Not a DRBD problem.
> >>> Appears to be a problem in the redhat kernel on vmware.
> >>> please have a look at
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=463573
> >>>   
> >>>       
> >> Thanks for point me out this bug, but I think we are speaking of
> >> different things. This bugs mention vmware machine as guest, this does
> >> not happen on the guest but on the host. Guest it's one windows machine.
> >>
> >> One more point, I had this vmware working on other machine without
> >> problems. Can this be interrupts?
> >>     
> >
> >
> > it definetely has something to do with interrupts,
> > as the stack trace you provided hints at a
> > spinlock deadlock in bottom half context,
> > or a livelock within the generic scsi layer.
> >
> >   
> >> Here is the interrupts table:
> >>
> >>            CPU0       CPU1       CPU2       CPU3
> >>   0:   85216538   85180125   85220346   85160716    IO-APIC-edge  timer
> >>   1:          8          1          1          0    IO-APIC-edge  i8042
> >>   4:      32854      32895      32997      32828    IO-APIC-edge  serial
> >>   7:          1          1          0          0    IO-APIC-edge  parport0
> >>   8:          0          1          0          0    IO-APIC-edge  rtc
> >>   9:          0          0          1          0   IO-APIC-level  acpi
> >>  50:          0          0          0          0         PCI-MSI  ahci
> >>  58:    1017131    1001386    1008608    1007834   IO-APIC-level  skge
> >>  66:    2995867    2969551    2982197    2975044   IO-APIC-level  eth3
> >>  74:    1431195    1496518    1426694    1506294   IO-APIC-level  eth4
> >>  82:   35769244          0          0          0         PCI-MSI  eth0
> >>  90:   34243658          0          0          0         PCI-MSI  eth1
> >> 233:    2817474    2829933    2839738    2827824   IO-APIC-level  arcmsr
> >> NMI:          0          0          0          0
> >> LOC:  337373327  337372791  336681613  336681080
> >> ERR:          0
> >> MIS:          0
> >>
> >> I wonder if skge and r8169 drivers are making problems with interrupts
> >> and drbd don't like it or even arcmsr that it's the areca controller
> >> storage driver.
> >>     
> >
> > I doubt that it has anything to do with drbd (other than drbd causing
> > disk and network load at the same time, and maybe altering timing and
> > latencies).
> >
> > I'll try to reason that by commenting the stack trace below.
> >
> >   
> >>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
> >>>>
> >>>> Pid: 0, comm:              swapper
> >>>> EIP: 0060:[<c0608e1b>] CPU: 0
> >>>> EIP is at _spin_lock_irqsave+0x13/0x27
> >>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
> >>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246
> >>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b
> >>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0
> >>>>  [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]
> >>>>  [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]
> >>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
> >>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
> >>>>  [<c0407451>] do_softirq+0x52/0x9d
> >>>>  [<c04073f6>] do_IRQ+0xa5/0xae
> >>>>  [<c040592e>] common_interrupt+0x1a/0x20
> >>>>  [<c0403ccf>] mwait_idle+0x25/0x38
> >>>>  [<c0403c90>] cpu_idle+0x9f/0xb9
> >>>>  [<c06eb9ee>] start_kernel+0x379/0x380
> >>>>         
> >
> >
> > see, no mention of drbd in there.
> > something tries to get a spinlock in softirq context,
> > but cannot, probably because it is held by someone in process context,
> > and that someone did not disable interrupts where it should have done,
> > or something re-enabled interrupts early.
> > or because the below mentioned livelock on the other cpu.
> >
> >   
> >>>>  =======================
> >>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]
> >>>>
> >>>> Pid: 4880, comm:       drbd0_receiver
> >>>> EIP: 0060:[<c0608e1b>] CPU: 1
> >>>> EIP is at _spin_lock_irqsave+0x13/0x27
> >>>>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
> >>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202
> >>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b
> >>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0
> >>>>  [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
> >>>>  [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
> >>>>  [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
> >>>>  [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
> >>>>  [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
> >>>>  [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
> >>>>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
> >>>>  [<c042ab5a>] __do_softirq+0x5a/0xbb
> >>>>  [<c0407451>] do_softirq+0x52/0x9d
> >>>>  [<c042a961>] local_bh_enable+0x74/0x7f
> >>>>         
> >
> > stack trace above is a soft irq,
> > which interrupted the current active process on this cpu,
> > which happened to be the drbd receiver.
> > but it has nothing directly to do with drbd
> > (other that it is likely that DRBD submitted the request
> > in the first place).
> >
> > some scsi thing finished, triggered a softirq,
> > and within ./drivers/scsi/scsi_lib.c:scsi_run_queue()
> > it either blocks on spin_lock_irqsave(shost->host_lock, flags),
> > or livelocks within that while loop there,
> > possibly while the other cpu (see above)
> > sits on the same spin_lock within
> > ./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev);
> >
> > this looks more like a problem in the generic scsi layer.
> > maybe it even has been fixed upstream,
> > http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9
> > sounds a bit similar (just skip forward to that comment 9,
> > the initial reports/logs/comments are not interessting).
> > maybe the readhat kernel has a similar problem,
> > by "backporting" the upstream commit
> >  f0c0a376d0fcd4c5579ecf5e95f88387cba85211,
> > which according to above mentioned bug broke it...
> >
> > otherwise, stress the box without DRBD.
> > try a different kernel (as you are with 2.6.18-RHEL,
> > start with a "kernel.org" 2.6.18).
> > do a kernel bisect, if you find
> > some kernel show this behaviour, and others do not.
> >
> > again, this has nothing to do with DRBD.
> >
> >
> > stack trace below is DRBD,
> > going into the tcp stack, waiting for something to receive.
> > nothing unusual there.
> >
> >   
> >>>>  [<c05d1407>] tcp_prequeue_process+0x5a/0x66
> >>>>  [<c05d3692>] tcp_recvmsg+0x416/0x9f7
> >>>>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
> >>>>  [<c05a5017>] sock_recvmsg+0xe5/0x100
> >>>>  [<c0436347>] autoremove_wake_function+0x0/0x2d
> >>>>  [<c05a6c97>] kernel_sendmsg+0x27/0x35
> >>>>  [<f8d87718>] drbd_send+0x77/0x13f [drbd]
> >>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
> >>>>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
> >>>>  [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd]
> >>>>  [<f8d78c4b>] drbdd+0x18/0x12b [drbd]
> >>>>  [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd]
> >>>>  [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]
> >>>>  [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]
> >>>>  [<c0405c3b>] kernel_thread_helper+0x7/0x10
> >>>>         
> >
> >
> >   

> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed