<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi Igor,&nbsp;<div><br></div><div>I was asking about the RAID hardware because I had similar (or at least similar looking) problems with DRBD 8.0.13, Kernel 2.6.18-6 (Debian Etch) and CPU soft lockups under high disk loads:</div><div>(<a href="http://www.gossamer-threads.com/lists/drbd/users/16399?search_string=flush;#16399)">http://www.gossamer-threads.com/lists/drbd/users/16399?search_string=flush;#16399)</a></div><div><br></div><div>Unfortunately, I did not have the time to do exhaustive testing (i.e. isolate the problem, reproduce without DRBD, change specific software versions, &nbsp;etc.) as I was under some pressure to go productive.&nbsp;What I did was trying to eliminate as many potential causes of the problems as possible, which meant upgrading to a newer kernel version and backporting the RAID controller driver, as well as upgrading to DRBD 8.0.14.&nbsp;</div><div><br></div><div>I have been doing some pretty extensive stress testing after that, and the system is running fine without any problems since then. My strongest bet concerning the cause of the problem is still the used RAID controller (LSI 1078), as driver versions in kernel 2.6.18 are known to cause problems under high load. But of course, I can't be sure. And as I'm now reading that you have similar problems even though you are using a different raid card, I'm wondering if there could be any problems with DRBD and that specific kernel version? Is it possible for you do upgrade the kernel and do some testing?</div><div><br></div><div>Best regards,&nbsp;</div><div><br></div><div>&nbsp;&nbsp;Thomas</div><div><br></div><div><br><div><div>Am 27.01.2009 um 17:47 schrieb Igor Neves:</div><br class="Apple-interchange-newline"><blockquote type="cite"> <div bgcolor="#ffffff" text="#000000"> Hi,<br> <br> Thomas Reinhold wrote: <blockquote cite="mid:0CFCF0EC-1F32-494E-A37C-E08ECFD4260B@thomasreinhold.de" type="cite">Hi,&nbsp;  <div><br>  </div>  <div>what kind of RAID hardware are you using?</div> </blockquote> <br> I'm using Areca's controller. But I'm pretty shore it's not Areca, because we have some more like this in production.<br> <br> <blockquote cite="mid:0CFCF0EC-1F32-494E-A37C-E08ECFD4260B@thomasreinhold.de" type="cite">  <div><br>  </div>  <div>Regards,&nbsp;</div>  <div><br>  </div>  <div>&nbsp;&nbsp;Thomas</div>  <div><br>  </div>  <div>  <div>  <div>Am 26.01.2009 um 18:18 schrieb Igor Neves:</div>  <br class="Apple-interchange-newline">  <blockquote type="cite">    <div bgcolor="#ffffff" text="#000000"> Hi,<br>    <br> Thanks for your help.<br>    <br> Lars Ellenberg wrote:    <blockquote cite="mid:20090126154910.GE9911@barkeeper1-xen.linbit" type="cite">      <pre wrap="">On Mon, Jan 26, 2009 at 03:12:02PM +0000, Igor Neves wrote:

  </pre>      <blockquote type="cite">        <pre wrap="">Hi,

I'm having hard problems with one machine with drbd, ans having this

kernel panic's

</pre>      </blockquote>      <pre wrap="">these are NOT kernel panics...

even though the may look very similar.

  </pre>    </blockquote>    <br> Yes, you are right, but they kill my machine like a kernel panic! :)<br>    <br>    <blockquote cite="mid:20090126154910.GE9911@barkeeper1-xen.linbit" type="cite">      <pre wrap="">  </pre>      <blockquote type="cite">        <pre wrap="">on the machine I run the vmware server.

</pre>      </blockquote>      <pre wrap="">thanks for mentioning this.</pre>    </blockquote> &nbsp; <br>    <blockquote cite="mid:20090126154910.GE9911@barkeeper1-xen.linbit" type="cite">      <pre wrap="">Does anyone know what this can be?

I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos

5.2 machine.

</pre>    </blockquote>    <blockquote cite="mid:20090126154910.GE9911@barkeeper1-xen.linbit" type="cite">      <pre wrap="">Not a DRBD problem.

Appears to be a problem in the redhat kernel on vmware.

please have a look at

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=463573">https://bugzilla.redhat.com/show_bug.cgi?id=463573</a>

  </pre>    </blockquote>    <br> Thanks for point me out this bug, but I think we are speaking of different things. This bugs mention vmware machine as guest, this does not happen on the guest but on the host. Guest it's one windows machine.<br>    <br> One more point, I had this vmware working on other machine without problems. Can this be interrupts?<br>    <br> Here is the interrupts table:<br>    <br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CPU0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CPU1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CPU2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CPU3<br> &nbsp; 0:&nbsp;&nbsp; 85216538&nbsp;&nbsp; 85180125&nbsp;&nbsp; 85220346&nbsp;&nbsp; 85160716&nbsp;&nbsp;&nbsp; IO-APIC-edge&nbsp; timer<br> &nbsp; 1:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; IO-APIC-edge&nbsp; i8042<br> &nbsp; 4:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32854&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32895&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32997&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32828&nbsp;&nbsp;&nbsp; IO-APIC-edge&nbsp; serial<br> &nbsp; 7:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; IO-APIC-edge&nbsp; parport0<br> &nbsp; 8:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; IO-APIC-edge&nbsp; rtc<br> &nbsp; 9:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp; IO-APIC-level&nbsp; acpi<br> &nbsp;50:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PCI-MSI&nbsp; ahci<br> &nbsp;58:&nbsp;&nbsp;&nbsp; 1017131&nbsp;&nbsp;&nbsp; 1001386&nbsp;&nbsp;&nbsp; 1008608&nbsp;&nbsp;&nbsp; 1007834&nbsp;&nbsp; IO-APIC-level&nbsp; skge<br> &nbsp;66:&nbsp;&nbsp;&nbsp; 2995867&nbsp;&nbsp;&nbsp; 2969551&nbsp;&nbsp;&nbsp; 2982197&nbsp;&nbsp;&nbsp; 2975044&nbsp;&nbsp; IO-APIC-level&nbsp; eth3<br> &nbsp;74:&nbsp;&nbsp;&nbsp; 1431195&nbsp;&nbsp;&nbsp; 1496518&nbsp;&nbsp;&nbsp; 1426694&nbsp;&nbsp;&nbsp; 1506294&nbsp;&nbsp; IO-APIC-level&nbsp; eth4<br> &nbsp;82:&nbsp;&nbsp; 35769244&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PCI-MSI&nbsp; eth0<br> &nbsp;90:&nbsp;&nbsp; 34243658&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PCI-MSI&nbsp; eth1<br> 233:&nbsp;&nbsp;&nbsp; 2817474&nbsp;&nbsp;&nbsp; 2829933&nbsp;&nbsp;&nbsp; 2839738&nbsp;&nbsp;&nbsp; 2827824&nbsp;&nbsp; IO-APIC-level&nbsp; arcmsr<br> NMI:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br> LOC:&nbsp; 337373327&nbsp; 337372791&nbsp; 336681613&nbsp; 336681080<br> ERR:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br> MIS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br>    <br> I wonder if skge and r8169 drivers are making problems with interrupts and drbd don't like it or even arcmsr that it's the areca controller storage driver.<br>    <br>    <blockquote cite="mid:20090126154910.GE9911@barkeeper1-xen.linbit" type="cite">      <pre wrap="">  </pre>      <blockquote type="cite">        <pre wrap="">Thanks

 =======================

BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]

Pid: 0, comm:              swapper

EIP: 0060:[&lt;c0608e1b>] CPU: 0

EIP is at _spin_lock_irqsave+0x13/0x27

 EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)

EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246

ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b

CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0

 [&lt;f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod]

 [&lt;f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod]

 [&lt;c04d8a34>] blk_done_softirq+0x4d/0x58

 [&lt;c042ab5a>] __do_softirq+0x5a/0xbb

 [&lt;c0407451>] do_softirq+0x52/0x9d

 [&lt;c04073f6>] do_IRQ+0xa5/0xae

 [&lt;c040592e>] common_interrupt+0x1a/0x20

 [&lt;c0403ccf>] mwait_idle+0x25/0x38

 [&lt;c0403c90>] cpu_idle+0x9f/0xb9

 [&lt;c06eb9ee>] start_kernel+0x379/0x380

 =======================

BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880]

Pid: 4880, comm:       drbd0_receiver

EIP: 0060:[&lt;c0608e1b>] CPU: 1

EIP is at _spin_lock_irqsave+0x13/0x27

 EFLAGS: 00000286    Tainted: GF      (2.6.18-92.1.22.el5PAE #1)

EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202

ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b

CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0

 [&lt;f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod]

 [&lt;f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod]

 [&lt;f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod]

 [&lt;f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod]

 [&lt;f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod]

 [&lt;f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod]

 [&lt;c04d8a34>] blk_done_softirq+0x4d/0x58

 [&lt;c042ab5a>] __do_softirq+0x5a/0xbb

 [&lt;c0407451>] do_softirq+0x52/0x9d

 [&lt;c042a961>] local_bh_enable+0x74/0x7f

 [&lt;c05d1407>] tcp_prequeue_process+0x5a/0x66

 [&lt;c05d3692>] tcp_recvmsg+0x416/0x9f7

 [&lt;c05a725e>] sock_common_recvmsg+0x2f/0x45

 [&lt;c05a5017>] sock_recvmsg+0xe5/0x100

 [&lt;c0436347>] autoremove_wake_function+0x0/0x2d

 [&lt;c05a6c97>] kernel_sendmsg+0x27/0x35

 [&lt;f8d87718>] drbd_send+0x77/0x13f [drbd]

 [&lt;f8d78485>] drbd_recv+0x57/0xd7 [drbd]

 [&lt;f8d78485>] drbd_recv+0x57/0xd7 [drbd]

 [&lt;f8d78694>] drbd_recv_header+0x10/0x94 [drbd]

 [&lt;f8d78c4b>] drbdd+0x18/0x12b [drbd]

 [&lt;f8d7b586>] drbdd_init+0xa0/0x173 [drbd]

 [&lt;f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd]

 [&lt;f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd]

 [&lt;c0405c3b>] kernel_thread_helper+0x7/0x10

    </pre>      </blockquote>      <pre wrap=""><!---->  </pre>    </blockquote>    </div> _______________________________________________<br> drbd-user mailing list<br>    <a moz-do-not-send="true" href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>    <a moz-do-not-send="true" href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>  </blockquote>  </div>  <br>  </div>  <pre wrap=""><hr size="4" width="90%">

_______________________________________________

drbd-user mailing list

<a class="moz-txt-link-abbreviated" href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a>

<a class="moz-txt-link-freetext" href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a>

  </pre> </blockquote> </div> </blockquote></div><br></div></body></html>