Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote: > > thanks for mentioning this. > > > Does anyone know what this can be? > > > > I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos > > 5.2 machine. > > > > > > Not a DRBD problem. > > Appears to be a problem in the redhat kernel on vmware. > > please have a look at > > https://bugzilla.redhat.com/show_bug.cgi?id=463573 > > > > Thanks for point me out this bug, but I think we are speaking of > different things. This bugs mention vmware machine as guest, this does > not happen on the guest but on the host. Guest it's one windows machine. > > One more point, I had this vmware working on other machine without > problems. Can this be interrupts? it definetely has something to do with interrupts, as the stack trace you provided hints at a spinlock deadlock in bottom half context, or a livelock within the generic scsi layer. > Here is the interrupts table: > > CPU0 CPU1 CPU2 CPU3 > 0: 85216538 85180125 85220346 85160716 IO-APIC-edge timer > 1: 8 1 1 0 IO-APIC-edge i8042 > 4: 32854 32895 32997 32828 IO-APIC-edge serial > 7: 1 1 0 0 IO-APIC-edge parport0 > 8: 0 1 0 0 IO-APIC-edge rtc > 9: 0 0 1 0 IO-APIC-level acpi > 50: 0 0 0 0 PCI-MSI ahci > 58: 1017131 1001386 1008608 1007834 IO-APIC-level skge > 66: 2995867 2969551 2982197 2975044 IO-APIC-level eth3 > 74: 1431195 1496518 1426694 1506294 IO-APIC-level eth4 > 82: 35769244 0 0 0 PCI-MSI eth0 > 90: 34243658 0 0 0 PCI-MSI eth1 > 233: 2817474 2829933 2839738 2827824 IO-APIC-level arcmsr > NMI: 0 0 0 0 > LOC: 337373327 337372791 336681613 336681080 > ERR: 0 > MIS: 0 > > I wonder if skge and r8169 drivers are making problems with interrupts > and drbd don't like it or even arcmsr that it's the areca controller > storage driver. I doubt that it has anything to do with drbd (other than drbd causing disk and network load at the same time, and maybe altering timing and latencies). I'll try to reason that by commenting the stack trace below. > >> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0] > >> > >> Pid: 0, comm: swapper > >> EIP: 0060:[<c0608e1b>] CPU: 0 > >> EIP is at _spin_lock_irqsave+0x13/0x27 > >> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) > >> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246 > >> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b > >> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0 > >> [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod] > >> [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod] > >> [<c04d8a34>] blk_done_softirq+0x4d/0x58 > >> [<c042ab5a>] __do_softirq+0x5a/0xbb > >> [<c0407451>] do_softirq+0x52/0x9d > >> [<c04073f6>] do_IRQ+0xa5/0xae > >> [<c040592e>] common_interrupt+0x1a/0x20 > >> [<c0403ccf>] mwait_idle+0x25/0x38 > >> [<c0403c90>] cpu_idle+0x9f/0xb9 > >> [<c06eb9ee>] start_kernel+0x379/0x380 see, no mention of drbd in there. something tries to get a spinlock in softirq context, but cannot, probably because it is held by someone in process context, and that someone did not disable interrupts where it should have done, or something re-enabled interrupts early. or because the below mentioned livelock on the other cpu. > >> ======================= > >> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880] > >> > >> Pid: 4880, comm: drbd0_receiver > >> EIP: 0060:[<c0608e1b>] CPU: 1 > >> EIP is at _spin_lock_irqsave+0x13/0x27 > >> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) > >> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202 > >> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b > >> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0 > >> [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod] > >> [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod] > >> [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod] > >> [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod] > >> [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod] > >> [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod] > >> [<c04d8a34>] blk_done_softirq+0x4d/0x58 > >> [<c042ab5a>] __do_softirq+0x5a/0xbb > >> [<c0407451>] do_softirq+0x52/0x9d > >> [<c042a961>] local_bh_enable+0x74/0x7f stack trace above is a soft irq, which interrupted the current active process on this cpu, which happened to be the drbd receiver. but it has nothing directly to do with drbd (other that it is likely that DRBD submitted the request in the first place). some scsi thing finished, triggered a softirq, and within ./drivers/scsi/scsi_lib.c:scsi_run_queue() it either blocks on spin_lock_irqsave(shost->host_lock, flags), or livelocks within that while loop there, possibly while the other cpu (see above) sits on the same spin_lock within ./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev); this looks more like a problem in the generic scsi layer. maybe it even has been fixed upstream, http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9 sounds a bit similar (just skip forward to that comment 9, the initial reports/logs/comments are not interessting). maybe the readhat kernel has a similar problem, by "backporting" the upstream commit f0c0a376d0fcd4c5579ecf5e95f88387cba85211, which according to above mentioned bug broke it... otherwise, stress the box without DRBD. try a different kernel (as you are with 2.6.18-RHEL, start with a "kernel.org" 2.6.18). do a kernel bisect, if you find some kernel show this behaviour, and others do not. again, this has nothing to do with DRBD. stack trace below is DRBD, going into the tcp stack, waiting for something to receive. nothing unusual there. > >> [<c05d1407>] tcp_prequeue_process+0x5a/0x66 > >> [<c05d3692>] tcp_recvmsg+0x416/0x9f7 > >> [<c05a725e>] sock_common_recvmsg+0x2f/0x45 > >> [<c05a5017>] sock_recvmsg+0xe5/0x100 > >> [<c0436347>] autoremove_wake_function+0x0/0x2d > >> [<c05a6c97>] kernel_sendmsg+0x27/0x35 > >> [<f8d87718>] drbd_send+0x77/0x13f [drbd] > >> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] > >> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] > >> [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd] > >> [<f8d78c4b>] drbdd+0x18/0x12b [drbd] > >> [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd] > >> [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd] > >> [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd] > >> [<c0405c3b>] kernel_thread_helper+0x7/0x10 -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed