[DRBD-user] Kernel Panic

Tue Jan 27 13:38:10 CET 2009

On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote:
> > thanks for mentioning this.
>  
> > Does anyone know what this can be?
> >
> > I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos
> > 5.2 machine.
> >   
> >
> > Not a DRBD problem.
> > Appears to be a problem in the redhat kernel on vmware.
> > please have a look at
> > https://bugzilla.redhat.com/show_bug.cgi?id=463573
> >   
> 
> Thanks for point me out this bug, but I think we are speaking of
> different things. This bugs mention vmware machine as guest, this does
> not happen on the guest but on the host. Guest it's one windows machine.
> 
> One more point, I had this vmware working on other machine without
> problems. Can this be interrupts?

it definetely has something to do with interrupts,
as the stack trace you provided hints at a
spinlock deadlock in bottom half context,
or a livelock within the generic scsi layer.

> Here is the interrupts table:
> 
>            CPU0       CPU1       CPU2       CPU3
>   0:   85216538   85180125   85220346   85160716    IO-APIC-edge  timer
>   1:          8          1          1          0    IO-APIC-edge  i8042
>   4:      32854      32895      32997      32828    IO-APIC-edge  serial
>   7:          1          1          0          0    IO-APIC-edge  parport0
>   8:          0          1          0          0    IO-APIC-edge  rtc
>   9:          0          0          1          0   IO-APIC-level  acpi
>  50:          0          0          0          0         PCI-MSI  ahci
>  58:    1017131    1001386    1008608    1007834   IO-APIC-level  skge
>  66:    2995867    2969551    2982197    2975044   IO-APIC-level  eth3
>  74:    1431195    1496518    1426694    1506294   IO-APIC-level  eth4
>  82:   35769244          0          0          0         PCI-MSI  eth0
>  90:   34243658          0          0          0         PCI-MSI  eth1
> 233:    2817474    2829933    2839738    2827824   IO-APIC-level  arcmsr
> NMI:          0          0          0          0
> LOC:  337373327  337372791  336681613  336681080
> ERR:          0
> MIS:          0
> 
> I wonder if skge and r8169 drivers are making problems with interrupts
> and drbd don't like it or even arcmsr that it's the areca controller
> storage driver.

I doubt that it has anything to do with drbd (other than drbd causing
disk and network load at the same time, and maybe altering timing and
latencies).

I'll try to reason that by commenting the stack trace below.

> >> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
> >>
> >> Pid: 0, comm:              swapper
> >> EIP: 0060:[<c0608e1b>] CPU: 0
> >> EIP is at _spin_lock_irqsave+0x13/0x27
> >>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
> >> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246
> >> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b
> >> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0
> >>  [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]
> >>  [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]
> >>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
> >>  [<c042ab5a>] __do_softirq+0x5a/0xbb
> >>  [<c0407451>] do_softirq+0x52/0x9d
> >>  [<c04073f6>] do_IRQ+0xa5/0xae
> >>  [<c040592e>] common_interrupt+0x1a/0x20
> >>  [<c0403ccf>] mwait_idle+0x25/0x38
> >>  [<c0403c90>] cpu_idle+0x9f/0xb9
> >>  [<c06eb9ee>] start_kernel+0x379/0x380

see, no mention of drbd in there.
something tries to get a spinlock in softirq context,
but cannot, probably because it is held by someone in process context,
and that someone did not disable interrupts where it should have done,
or something re-enabled interrupts early.
or because the below mentioned livelock on the other cpu.

> >>  =======================
> >> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]
> >>
> >> Pid: 4880, comm:       drbd0_receiver
> >> EIP: 0060:[<c0608e1b>] CPU: 1
> >> EIP is at _spin_lock_irqsave+0x13/0x27
> >>  EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)
> >> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202
> >> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b
> >> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0
> >>  [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
> >>  [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
> >>  [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
> >>  [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
> >>  [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
> >>  [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
> >>  [<c04d8a34>] blk_done_softirq+0x4d/0x58
> >>  [<c042ab5a>] __do_softirq+0x5a/0xbb
> >>  [<c0407451>] do_softirq+0x52/0x9d
> >>  [<c042a961>] local_bh_enable+0x74/0x7f

stack trace above is a soft irq,
which interrupted the current active process on this cpu,
which happened to be the drbd receiver.
but it has nothing directly to do with drbd
(other that it is likely that DRBD submitted the request
in the first place).

some scsi thing finished, triggered a softirq,
and within ./drivers/scsi/scsi_lib.c:scsi_run_queue()
it either blocks on spin_lock_irqsave(shost->host_lock, flags),
or livelocks within that while loop there,
possibly while the other cpu (see above)
sits on the same spin_lock within
./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev);

this looks more like a problem in the generic scsi layer.
maybe it even has been fixed upstream,
http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9
sounds a bit similar (just skip forward to that comment 9,
the initial reports/logs/comments are not interessting).
maybe the readhat kernel has a similar problem,
by "backporting" the upstream commit
 f0c0a376d0fcd4c5579ecf5e95f88387cba85211,
which according to above mentioned bug broke it...

otherwise, stress the box without DRBD.
try a different kernel (as you are with 2.6.18-RHEL,
start with a "kernel.org" 2.6.18).
do a kernel bisect, if you find
some kernel show this behaviour, and others do not.

again, this has nothing to do with DRBD.

stack trace below is DRBD,
going into the tcp stack, waiting for something to receive.
nothing unusual there.

> >>  [<c05d1407>] tcp_prequeue_process+0x5a/0x66
> >>  [<c05d3692>] tcp_recvmsg+0x416/0x9f7
> >>  [<c05a725e>] sock_common_recvmsg+0x2f/0x45
> >>  [<c05a5017>] sock_recvmsg+0xe5/0x100
> >>  [<c0436347>] autoremove_wake_function+0x0/0x2d
> >>  [<c05a6c97>] kernel_sendmsg+0x27/0x35
> >>  [<f8d87718>] drbd_send+0x77/0x13f [drbd]
> >>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
> >>  [<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
> >>  [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd]
> >>  [<f8d78c4b>] drbdd+0x18/0x12b [drbd]
> >>  [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd]
> >>  [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]
> >>  [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]
> >>  [<c0405c3b>] kernel_thread_helper+0x7/0x10

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed