<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi,<br>

<br>

Take a look in this one, this happen during this night, on the other

node machine, that was not running vmware.<br>

<br>

I have also changed the soft lockup threshold to 60 seconds, but not

helped.<br>

<br>

BUG: soft lockup - CPU#3 stuck for 60s! [drbd0_receiver:8329]<br>

<br>

Pid: 8329, comm:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; drbd0_receiver<br>

EIP: 0060:[&lt;c0608e1b&gt;] CPU: 3<br>

EIP is at _spin_lock_irqsave+0x13/0x27<br>

&nbsp;EFLAGS: 00000282&nbsp;&nbsp;&nbsp; Tainted: GF&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (2.6.18-92.1.22.el5PAE #1)<br>

EAX: f79e0028 EBX: ef110080 ECX: f7c3c000 EDX: 00000202<br>

ESI: f79e0000 EDI: f7c3c094 EBP: 00000001 DS: 007b ES: 007b<br>

CR0: 8005003b CR2: 08cd0000 CR3: 37804d40 CR4: 000006f0<br>

&nbsp;[&lt;f887edef&gt;] scsi_run_queue+0xcd/0x189 [scsi_mod]<br>

&nbsp;[&lt;f887f46e&gt;] scsi_next_command+0x25/0x2f [scsi_mod]<br>

&nbsp;[&lt;f887f583&gt;] scsi_end_request+0x9f/0xa9 [scsi_mod]<br>

&nbsp;[&lt;f887f6cd&gt;] scsi_io_completion+0x140/0x2ea [scsi_mod]<br>

&nbsp;[&lt;f885a3d2&gt;] sd_rw_intr+0x1f1/0x21b [sd_mod]<br>

&nbsp;[&lt;f887b3b9&gt;] scsi_finish_command+0x73/0x77 [scsi_mod]<br>

&nbsp;[&lt;c04d8a34&gt;] blk_done_softirq+0x4d/0x58<br>

&nbsp;[&lt;c042ab5a&gt;] __do_softirq+0x5a/0xbb<br>

&nbsp;[&lt;c0407451&gt;] do_softirq+0x52/0x9d<br>

&nbsp;[&lt;c04073f6&gt;] do_IRQ+0xa5/0xae<br>

&nbsp;[&lt;c040592e&gt;] common_interrupt+0x1a/0x20<br>

&nbsp;[&lt;c046bf5a&gt;] kfree+0x68/0x6c<br>

&nbsp;[&lt;c05aa45c&gt;] kfree_skbmem+0x8/0x61<br>

&nbsp;[&lt;c05d3906&gt;] tcp_recvmsg+0x68a/0x9f7<br>

&nbsp;[&lt;c0608e47&gt;] _spin_lock_bh+0x8/0x18<br>

&nbsp;[&lt;c05a725e&gt;] sock_common_recvmsg+0x2f/0x45<br>

&nbsp;[&lt;c05a5017&gt;] sock_recvmsg+0xe5/0x100<br>

&nbsp;[&lt;c0436347&gt;] autoremove_wake_function+0x0/0x2d<br>

&nbsp;[&lt;c05a725e&gt;] sock_common_recvmsg+0x2f/0x45<br>

&nbsp;[&lt;c0436347&gt;] autoremove_wake_function+0x0/0x2d<br>

&nbsp;[&lt;c0455fb0&gt;] mempool_alloc+0x28/0xc9<br>

&nbsp;[&lt;c04750de&gt;] bio_add_page+0x25/0x2b<br>

&nbsp;[&lt;f8b83485&gt;] drbd_recv+0x57/0xd7 [drbd]<br>

&nbsp;[&lt;f8b8536a&gt;] read_in_block+0x7f/0xff [drbd]<br>

&nbsp;[&lt;f8b87dde&gt;] receive_Data+0x135/0x9a2 [drbd]<br>

&nbsp;[&lt;f8b83485&gt;] drbd_recv+0x57/0xd7 [drbd]<br>

&nbsp;[&lt;f8b83c95&gt;] drbdd+0x62/0x12b [drbd]<br>

&nbsp;[&lt;f8b86586&gt;] drbdd_init+0xa0/0x173 [drbd]<br>

&nbsp;[&lt;f8b94e47&gt;] drbd_thread_setup+0xbb/0x150 [drbd]<br>

&nbsp;[&lt;f8b94d8c&gt;] drbd_thread_setup+0x0/0x150 [drbd]<br>

&nbsp;[&lt;c0405c3b&gt;] kernel_thread_helper+0x7/0x10<br>

<br>

<br>

Lars Ellenberg wrote:

<blockquote cite="mid:20090127123810.GD9625@barkeeper1-xen.linbit"

 type="cite">

  <pre wrap="">On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote:

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">thanks for mentioning this.

      </pre>

    </blockquote>

    <pre wrap=""> 

    </pre>

    <blockquote type="cite">

      <pre wrap="">Does anyone know what this can be?

I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos

5.2 machine.

Not a DRBD problem.

Appears to be a problem in the redhat kernel on vmware.

please have a look at

<a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=463573">https://bugzilla.redhat.com/show_bug.cgi?id=463573</a>

      </pre>

    </blockquote>

    <pre wrap="">Thanks for point me out this bug, but I think we are speaking of

different things. This bugs mention vmware machine as guest, this does

not happen on the guest but on the host. Guest it's one windows machine.

One more point, I had this vmware working on other machine without

problems. Can this be interrupts?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

it definetely has something to do with interrupts,

as the stack trace you provided hints at a

spinlock deadlock in bottom half context,

or a livelock within the generic scsi layer.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Here is the interrupts table:

           CPU0       CPU1       CPU2       CPU3

  0:   85216538   85180125   85220346   85160716    IO-APIC-edge  timer

  1:          8          1          1          0    IO-APIC-edge  i8042

  4:      32854      32895      32997      32828    IO-APIC-edge  serial

  7:          1          1          0          0    IO-APIC-edge  parport0

  8:          0          1          0          0    IO-APIC-edge  rtc

  9:          0          0          1          0   IO-APIC-level  acpi

 50:          0          0          0          0         PCI-MSI  ahci

 58:    1017131    1001386    1008608    1007834   IO-APIC-level  skge

 66:    2995867    2969551    2982197    2975044   IO-APIC-level  eth3

 74:    1431195    1496518    1426694    1506294   IO-APIC-level  eth4

 82:   35769244          0          0          0         PCI-MSI  eth0

 90:   34243658          0          0          0         PCI-MSI  eth1

233:    2817474    2829933    2839738    2827824   IO-APIC-level  arcmsr

NMI:          0          0          0          0

LOC:  337373327  337372791  336681613  336681080

ERR:          0

MIS:          0

I wonder if skge and r8169 drivers are making problems with interrupts

and drbd don't like it or even arcmsr that it's the areca controller

storage driver.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I doubt that it has anything to do with drbd (other than drbd causing

disk and network load at the same time, and maybe altering timing and

latencies).

I'll try to reason that by commenting the stack trace below.

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap="">BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]

Pid: 0, comm:              swapper

EIP: 0060:[&lt;c0608e1b&gt;] CPU: 0

EIP is at _spin_lock_irqsave+0x13/0x27

 EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)

EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246

ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b

CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0

 [&lt;f887f8d1&gt;] scsi_device_unbusy+0xf/0x69 [scsi_mod]

 [&lt;f887b356&gt;] scsi_finish_command+0x10/0x77 [scsi_mod]

 [&lt;c04d8a34&gt;] blk_done_softirq+0x4d/0x58

 [&lt;c042ab5a&gt;] __do_softirq+0x5a/0xbb

 [&lt;c0407451&gt;] do_softirq+0x52/0x9d

 [&lt;c04073f6&gt;] do_IRQ+0xa5/0xae

 [&lt;c040592e&gt;] common_interrupt+0x1a/0x20

 [&lt;c0403ccf&gt;] mwait_idle+0x25/0x38

 [&lt;c0403c90&gt;] cpu_idle+0x9f/0xb9

 [&lt;c06eb9ee&gt;] start_kernel+0x379/0x380

        </pre>

      </blockquote>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->

see, no mention of drbd in there.

something tries to get a spinlock in softirq context,

but cannot, probably because it is held by someone in process context,

and that someone did not disable interrupts where it should have done,

or something re-enabled interrupts early.

or because the below mentioned livelock on the other cpu.

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap=""> =======================

BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]

Pid: 4880, comm:       drbd0_receiver

EIP: 0060:[&lt;c0608e1b&gt;] CPU: 1

EIP is at _spin_lock_irqsave+0x13/0x27

 EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)

EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202

ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b

CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0

 [&lt;f887edef&gt;] scsi_run_queue+0xcd/0x189 [scsi_mod]

 [&lt;f887f46e&gt;] scsi_next_command+0x25/0x2f [scsi_mod]

 [&lt;f887f583&gt;] scsi_end_request+0x9f/0xa9 [scsi_mod]

 [&lt;f887f6cd&gt;] scsi_io_completion+0x140/0x2ea [scsi_mod]

 [&lt;f885a3d2&gt;] sd_rw_intr+0x1f1/0x21b [sd_mod]

 [&lt;f887b3b9&gt;] scsi_finish_command+0x73/0x77 [scsi_mod]

 [&lt;c04d8a34&gt;] blk_done_softirq+0x4d/0x58

 [&lt;c042ab5a&gt;] __do_softirq+0x5a/0xbb

 [&lt;c0407451&gt;] do_softirq+0x52/0x9d

 [&lt;c042a961&gt;] local_bh_enable+0x74/0x7f

        </pre>

      </blockquote>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->

stack trace above is a soft irq,

which interrupted the current active process on this cpu,

which happened to be the drbd receiver.

but it has nothing directly to do with drbd

(other that it is likely that DRBD submitted the request

in the first place).

some scsi thing finished, triggered a softirq,

and within ./drivers/scsi/scsi_lib.c:scsi_run_queue()

it either blocks on spin_lock_irqsave(shost-&gt;host_lock, flags),

or livelocks within that while loop there,

possibly while the other cpu (see above)

sits on the same spin_lock within

./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev);

this looks more like a problem in the generic scsi layer.

maybe it even has been fixed upstream,

<a class="moz-txt-link-freetext" href="http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9">http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9</a>

sounds a bit similar (just skip forward to that comment 9,

the initial reports/logs/comments are not interessting).

maybe the readhat kernel has a similar problem,

by "backporting" the upstream commit

 f0c0a376d0fcd4c5579ecf5e95f88387cba85211,

which according to above mentioned bug broke it...

otherwise, stress the box without DRBD.

try a different kernel (as you are with 2.6.18-RHEL,

start with a "kernel.org" 2.6.18).

do a kernel bisect, if you find

some kernel show this behaviour, and others do not.

again, this has nothing to do with DRBD.

stack trace below is DRBD,

going into the tcp stack, waiting for something to receive.

nothing unusual there.

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap=""> [&lt;c05d1407&gt;] tcp_prequeue_process+0x5a/0x66

 [&lt;c05d3692&gt;] tcp_recvmsg+0x416/0x9f7

 [&lt;c05a725e&gt;] sock_common_recvmsg+0x2f/0x45

 [&lt;c05a5017&gt;] sock_recvmsg+0xe5/0x100

 [&lt;c0436347&gt;] autoremove_wake_function+0x0/0x2d

 [&lt;c05a6c97&gt;] kernel_sendmsg+0x27/0x35

 [&lt;f8d87718&gt;] drbd_send+0x77/0x13f [drbd]

 [&lt;f8d78485&gt;] drbd_recv+0x57/0xd7 [drbd]

 [&lt;f8d78485&gt;] drbd_recv+0x57/0xd7 [drbd]

 [&lt;f8d78694&gt;] drbd_recv_header+0x10/0x94 [drbd]

 [&lt;f8d78c4b&gt;] drbdd+0x18/0x12b [drbd]

 [&lt;f8d7b586&gt;] drbdd_init+0xa0/0x173 [drbd]

 [&lt;f8d89e47&gt;] drbd_thread_setup+0xbb/0x150 [drbd]

 [&lt;f8d89d8c&gt;] drbd_thread_setup+0x0/0x150 [drbd]

 [&lt;c0405c3b&gt;] kernel_thread_helper+0x7/0x10

        </pre>

      </blockquote>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

</body>

</html>