Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Thanks so much for your help. Lars Ellenberg wrote: > On Mon, Jan 26, 2009 at 05:18:11PM +0000, Igor Neves wrote: > >>> thanks for mentioning this. >>> >> >> >>> Does anyone know what this can be? >>> >>> I'm using drbd 8.0.13 on kernel 2.6.18-92.1.22.el5PAE, this is a Centos >>> 5.2 machine. >>> >>> >>> Not a DRBD problem. >>> Appears to be a problem in the redhat kernel on vmware. >>> please have a look at >>> https://bugzilla.redhat.com/show_bug.cgi?id=463573 >>> >>> >> Thanks for point me out this bug, but I think we are speaking of >> different things. This bugs mention vmware machine as guest, this does >> not happen on the guest but on the host. Guest it's one windows machine. >> >> One more point, I had this vmware working on other machine without >> problems. Can this be interrupts? >> > > > it definetely has something to do with interrupts, > as the stack trace you provided hints at a > spinlock deadlock in bottom half context, > or a livelock within the generic scsi layer. > > >> Here is the interrupts table: >> >> CPU0 CPU1 CPU2 CPU3 >> 0: 85216538 85180125 85220346 85160716 IO-APIC-edge timer >> 1: 8 1 1 0 IO-APIC-edge i8042 >> 4: 32854 32895 32997 32828 IO-APIC-edge serial >> 7: 1 1 0 0 IO-APIC-edge parport0 >> 8: 0 1 0 0 IO-APIC-edge rtc >> 9: 0 0 1 0 IO-APIC-level acpi >> 50: 0 0 0 0 PCI-MSI ahci >> 58: 1017131 1001386 1008608 1007834 IO-APIC-level skge >> 66: 2995867 2969551 2982197 2975044 IO-APIC-level eth3 >> 74: 1431195 1496518 1426694 1506294 IO-APIC-level eth4 >> 82: 35769244 0 0 0 PCI-MSI eth0 >> 90: 34243658 0 0 0 PCI-MSI eth1 >> 233: 2817474 2829933 2839738 2827824 IO-APIC-level arcmsr >> NMI: 0 0 0 0 >> LOC: 337373327 337372791 336681613 336681080 >> ERR: 0 >> MIS: 0 >> >> I wonder if skge and r8169 drivers are making problems with interrupts >> and drbd don't like it or even arcmsr that it's the areca controller >> storage driver. >> > > I doubt that it has anything to do with drbd (other than drbd causing > disk and network load at the same time, and maybe altering timing and > latencies). > > I'll try to reason that by commenting the stack trace below. > > >>>> BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0] >>>> >>>> Pid: 0, comm: swapper >>>> EIP: 0060:[<c0608e1b>] CPU: 0 >>>> EIP is at _spin_lock_irqsave+0x13/0x27 >>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) >>>> EAX: f79c4028 EBX: f79c4000 ECX: c072afdc EDX: 00000246 >>>> ESI: f7c4ac00 EDI: f7c4ac00 EBP: 00000000 DS: 007b ES: 007b >>>> CR0: 8005003b CR2: 9fb19000 CR3: 00724000 CR4: 000006f0 >>>> [<f887f8d1>] scsi_device_unbusy+0xf/0x69 [scsi_mod] >>>> [<f887b356>] scsi_finish_command+0x10/0x77 [scsi_mod] >>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 >>>> [<c042ab5a>] __do_softirq+0x5a/0xbb >>>> [<c0407451>] do_softirq+0x52/0x9d >>>> [<c04073f6>] do_IRQ+0xa5/0xae >>>> [<c040592e>] common_interrupt+0x1a/0x20 >>>> [<c0403ccf>] mwait_idle+0x25/0x38 >>>> [<c0403c90>] cpu_idle+0x9f/0xb9 >>>> [<c06eb9ee>] start_kernel+0x379/0x380 >>>> > > > see, no mention of drbd in there. > something tries to get a spinlock in softirq context, > but cannot, probably because it is held by someone in process context, > and that someone did not disable interrupts where it should have done, > or something re-enabled interrupts early. > or because the below mentioned livelock on the other cpu. > > >>>> ======================= >>>> BUG: soft lockup - CPU#1 stuck for 10s! [drbd0_receiver:4880] >>>> >>>> Pid: 4880, comm: drbd0_receiver >>>> EIP: 0060:[<c0608e1b>] CPU: 1 >>>> EIP is at _spin_lock_irqsave+0x13/0x27 >>>> EFLAGS: 00000286 Tainted: GF (2.6.18-92.1.22.el5PAE #1) >>>> EAX: f79c4028 EBX: ca5ff6c0 ECX: f7c4ac00 EDX: 00000202 >>>> ESI: f79c4000 EDI: f7c4ac94 EBP: 00000001 DS: 007b ES: 007b >>>> CR0: 8005003b CR2: 7ff72000 CR3: 00724000 CR4: 000006f0 >>>> [<f887edef>] scsi_run_queue+0xcd/0x189 [scsi_mod] >>>> [<f887f46e>] scsi_next_command+0x25/0x2f [scsi_mod] >>>> [<f887f583>] scsi_end_request+0x9f/0xa9 [scsi_mod] >>>> [<f887f6cd>] scsi_io_completion+0x140/0x2ea [scsi_mod] >>>> [<f885a3d2>] sd_rw_intr+0x1f1/0x21b [sd_mod] >>>> [<f887b3b9>] scsi_finish_command+0x73/0x77 [scsi_mod] >>>> [<c04d8a34>] blk_done_softirq+0x4d/0x58 >>>> [<c042ab5a>] __do_softirq+0x5a/0xbb >>>> [<c0407451>] do_softirq+0x52/0x9d >>>> [<c042a961>] local_bh_enable+0x74/0x7f >>>> > > stack trace above is a soft irq, > which interrupted the current active process on this cpu, > which happened to be the drbd receiver. > but it has nothing directly to do with drbd > (other that it is likely that DRBD submitted the request > in the first place). > > some scsi thing finished, triggered a softirq, > and within ./drivers/scsi/scsi_lib.c:scsi_run_queue() > it either blocks on spin_lock_irqsave(shost->host_lock, flags), > or livelocks within that while loop there, > possibly while the other cpu (see above) > sits on the same spin_lock within > ./drivers/scsi/scsi_lib.c:scsi_device_unbusy(sdev); > > this looks more like a problem in the generic scsi layer. > maybe it even has been fixed upstream, > http://bugzilla.kernel.org/show_bug.cgi?id=11898#c9 > sounds a bit similar (just skip forward to that comment 9, > the initial reports/logs/comments are not interessting). > maybe the readhat kernel has a similar problem, > by "backporting" the upstream commit > f0c0a376d0fcd4c5579ecf5e95f88387cba85211, > which according to above mentioned bug broke it... > > otherwise, stress the box without DRBD. > Have done this, and without any problems. I don't know exactly when this happen, but normally it's when we generate some I/O copying files to the disk. But this only happens when doing it over drbd and after some time (1-2 days...). If I do it on the same controller and over the same disks without drbd on the way, I have no problems at all. And we have this controllers working wonderful everywhere, so as drbd. So I must insist this have something to do with NIC's drbd uses to replicate while it's writing, can this be true? Or I'm terrible mistaken? > try a different kernel (as you are with 2.6.18-RHEL, > start with a "kernel.org" 2.6.18). > do a kernel bisect, if you find > some kernel show this behaviour, and others do not. > > again, this has nothing to do with DRBD. > > > stack trace below is DRBD, > going into the tcp stack, waiting for something to receive. > nothing unusual there. > Does this stack trace stops when waiting for receive from the network (NIC) ? > >>>> [<c05d1407>] tcp_prequeue_process+0x5a/0x66 >>>> [<c05d3692>] tcp_recvmsg+0x416/0x9f7 >>>> [<c05a725e>] sock_common_recvmsg+0x2f/0x45 >>>> [<c05a5017>] sock_recvmsg+0xe5/0x100 >>>> [<c0436347>] autoremove_wake_function+0x0/0x2d >>>> [<c05a6c97>] kernel_sendmsg+0x27/0x35 >>>> [<f8d87718>] drbd_send+0x77/0x13f [drbd] >>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] >>>> [<f8d78485>] drbd_recv+0x57/0xd7 [drbd] >>>> [<f8d78694>] drbd_recv_header+0x10/0x94 [drbd] >>>> [<f8d78c4b>] drbdd+0x18/0x12b [drbd] >>>> [<f8d7b586>] drbdd_init+0xa0/0x173 [drbd] >>>> [<f8d89e47>] drbd_thread_setup+0xbb/0x150 [drbd] >>>> [<f8d89d8c>] drbd_thread_setup+0x0/0x150 [drbd] >>>> [<c0405c3b>] kernel_thread_helper+0x7/0x10 >>>> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090127/72186e0a/attachment.htm>