<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi,<br>
<br>
Take a look in this one, this happen during this night, on the other
node machine, that was not running vmware.<br>
<br>
I have also changed the soft lockup threshold to 60 seconds, but not
helped.<br>
<br>
BUG: soft lockup - CPU#3 stuck for 60s! [drbd0_receiver:8329]<br>
<br>
Pid: 8329, comm: drbd0_receiver<br>
EIP: 0060:[<c0608e1b>] CPU: 3<br>
EIP is at _spin_lock_irqsave+0x13/0x27<br>
EFLAGS: 00000282 Tainted: GF (2.6.18-92.1.22.el5PAE #1)<br>
EAX: f79e0028 EBX: ef110080 ECX: f7c3c000 EDX: 00000202<br>
ESI: f79e0000 EDI: f7c3c094 EBP: 00000001 DS: 007b ES: 007b<br>
CR0: 8005003b CR2: 08cd0000 CR3: 37804d40 CR4: 000006f0<br>
[<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]<br>
[<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]<br>
[<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]<br>
[<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]<br>
[<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]<br>
[<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]<br>
[<c04d8a34>] blk_done_softirq+0x4d/0x58<br>
[<c042ab5a>] __do_softirq+0x5a/0xbb<br>
[<c0407451>] do_softirq+0x52/0x9d<br>
[<c04073f6>] do_IRQ+0xa5/0xae<br>
[<c040592e>] common_interrupt+0x1a/0x20<br>
[<c046bf5a>] kfree+0x68/0x6c<br>
[<c05aa45c>] kfree_skbmem+0x8/0x61<br>
[<c05d3906>] tcp_recvmsg+0x68a/0x9f7<br>
[<c0608e47>] _spin_lock_bh+0x8/0x18<br>
[<c05a725e>] sock_common_recvmsg+0x2f/0x45<br>
[<c05a5017>] sock_recvmsg+0xe5/0x100<br>
[<c0436347>] autoremove_wake_function+0x0/0x2d<br>
[<c05a725e>] sock_common_recvmsg+0x2f/0x45<br>
[<c0436347>] autoremove_wake_function+0x0/0x2d<br>
[<c0455fb0>] mempool_alloc+0x28/0xc9<br>
[<c04750de>] bio_add_page+0x25/0x2b<br>
[<f8b83485>] drbd_recv+0x57/0xd7 [drbd]<br>
[<f8b8536a>] read_in_block+0x7f/0xff [drbd]<br>
[<f8b87dde>] receive_Data+0x135/0x9a2 [drbd]<br>
[<f8b83485>] drbd_recv+0x57/0xd7 [drbd]<br>
[<f8b83c95>] drbdd+0x62/0x12b [drbd]<br>
[<f8b86586>] drbdd_init+0xa0/0x173 [drbd]<br>
[<f8b94e47>] drbd_thread_setup+0xbb/0x150 [drbd]<br>
[<f8b94d8c>] drbd_thread_setup+0x0/0x150 [drbd]<br>
[<c0405c3b>] kernel_thread_helper+0x7/0x10<br>
<br>
<br>
Lars Ellenberg wrote:
<blockquote cite="mid:20090127123810.GD9625@barkeeper1-xen.linbit"
type="cite">
<pre wrap="">On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">thanks for mentioning this.
</pre>
</blockquote>
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">Does anyone know what this can be?
I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos
5.2 machine.
Not a DRBD problem.
Appears to be a problem in the redhat kernel on vmware.
please have a look at
<a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=463573">https://bugzilla.redhat.com/show_bug.cgi?id=463573</a>
</pre>
</blockquote>
<pre wrap="">Thanks for point me out this bug, but I think we are speaking of
different things. This bugs mention vmware machine as guest, this does
not happen on the guest but on the host. Guest it's one windows machine.
One more point, I had this vmware working on other machine without
problems. Can this be interrupts?
</pre>
</blockquote>
<pre wrap=""><!---->
it definetely has something to do with interrupts,
as the stack trace you provided hints at a
spinlock deadlock in bottom half context,
or a livelock within the generic scsi layer.
</pre>
<blockquote type="cite">
<pre wrap="">Here is the interrupts table:
CPU0 CPU1 CPU2 CPU3
0: 85216538 85180125 85220346 85160716 IO-APIC-edge timer
1: 8 1 1 0 IO-APIC-edge i8042
4: 32854 32895 32997 32828 IO-APIC-edge serial
7: 1 1 0 0 IO-APIC-edge parport0
8: 0 1 0 0 IO-APIC-edge rtc
9: 0 0 1 0 IO-APIC-level acpi
50: 0 0 0 0 PCI-MSI ahci
58: 1017131 1001386 1008608 1007834 IO-APIC-level skge
66: 2995867 2969551 2982197 2975044 IO-APIC-level eth3
74: 1431195 1496518 1426694 1506294 IO-APIC-level eth4
82: 35769244 0 0 0 PCI-MSI eth0
90: 34243658 0 0 0 PCI-MSI eth1
233: 2817474 2829933 2839738 2827824 IO-APIC-level arcmsr
NMI: 0 0 0 0
LOC: 337373327 337372791 336681613 336681080
ERR: 0
MIS: 0
I wonder if skge and r8169 drivers are making problems with interrupts
and drbd don't like it or even arcmsr that it's the areca controller
storage driver.
</pre>
</blockquote>
<pre wrap=""><!---->
I doubt that it has anything to do with drbd (other than drbd causing
disk and network load at the same time, and maybe altering timing and
latencies).
I'll try to reason that by commenting the stack trace below.
</pre>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Pid: 0, comm: swapper
EIP: 0060:[<c0608e1b>] CPU: 0
EIP is at _spin_lock_irqsave+0x13/0x27
EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1)
EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246
ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0
[<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]
[<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]
[<c04d8a34>] blk_done_softirq+0x4d/0x58
[<c042ab5a>] __do_softirq+0x5a/0xbb
[<c0407451>] do_softirq+0x52/0x9d
[<c04073f6>] do_IRQ+0xa5/0xae
[<c040592e>] common_interrupt+0x1a/0x20
[<c0403ccf>] mwait_idle+0x25/0x38
[<c0403c90>] cpu_idle+0x9f/0xb9
[<c06eb9ee>] start_kernel+0x379/0x380
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre wrap=""><!---->
see, no mention of drbd in there.
something tries to get a spinlock in softirq context,
but cannot, probably because it is held by someone in process context,
and that someone did not disable interrupts where it should have done,
or something re-enabled interrupts early.
or because the below mentioned livelock on the other cpu.
</pre>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre wrap=""> =======================
BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]
Pid: 4880, comm: drbd0_receiver
EIP: 0060:[<c0608e1b>] CPU: 1
EIP is at _spin_lock_irqsave+0x13/0x27
EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1)
EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202
ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b
CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0
[<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]
[<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]
[<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]
[<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]
[<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]
[<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]
[<c04d8a34>] blk_done_softirq+0x4d/0x58
[<c042ab5a>] __do_softirq+0x5a/0xbb
[<c0407451>] do_softirq+0x52/0x9d
[<c042a961>] local_bh_enable+0x74/0x7f
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre wrap=""><!---->
stack trace above is a soft irq,
which interrupted the current active process on this cpu,
which happened to be the drbd receiver.
but it has nothing directly to do with drbd
(other that it is likely that DRBD submitted the request
in the first place).
some scsi thing finished, triggered a softirq,
and within ./drivers/scsi/scsi_lib.c:scsi_run_queue()
it either blocks on spin_lock_irqsave(shost->host_lock, flags),
or livelocks within that while loop there,
possibly while the other cpu (see above)
sits on the same spin_lock within
./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev);
this looks more like a problem in the generic scsi layer.
maybe it even has been fixed upstream,
<a class="moz-txt-link-freetext" href="http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9">http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9</a>
sounds a bit similar (just skip forward to that comment 9,
the initial reports/logs/comments are not interessting).
maybe the readhat kernel has a similar problem,
by "backporting" the upstream commit
f0c0a376d0fcd4c5579ecf5e95f88387cba85211,
which according to above mentioned bug broke it...
otherwise, stress the box without DRBD.
try a different kernel (as you are with 2.6.18-RHEL,
start with a "kernel.org" 2.6.18).
do a kernel bisect, if you find
some kernel show this behaviour, and others do not.
again, this has nothing to do with DRBD.
stack trace below is DRBD,
going into the tcp stack, waiting for something to receive.
nothing unusual there.
</pre>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre wrap=""> [<c05d1407>] tcp_prequeue_process+0x5a/0x66
[<c05d3692>] tcp_recvmsg+0x416/0x9f7
[<c05a725e>] sock_common_recvmsg+0x2f/0x45
[<c05a5017>] sock_recvmsg+0xe5/0x100
[<c0436347>] autoremove_wake_function+0x0/0x2d
[<c05a6c97>] kernel_sendmsg+0x27/0x35
[<f8d87718>] drbd_send+0x77/0x13f [drbd]
[<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
[<f8d78485>] drbd_recv+0x57/0xd7 [drbd]
[<f8d78694>] drbd_recv_header+0x10/0x94 [drbd]
[<f8d78c4b>] drbdd+0x18/0x12b [drbd]
[<f8d7b586>] drbdd_init+0xa0/0x173 [drbd]
[<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]
[<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]
[<c0405c3b>] kernel_thread_helper+0x7/0x10
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre wrap=""><!---->
</pre>
</blockquote>
</body>
</html>