We have a  web application that delivers training to thousands of corporate students running on top of SQL Server 2005. Recently, we started seeing that a single specific query in the application went from 1 second to about 30 seconds in terms of execution time. The application started throwing timeouts in that area.
Our first thought was that we may have incorrect indexes, so we reviewed the tables and indexes. However, similar queries elsewhere in the application also run quickly. Reviewing the indexes showed us that they were configured as expected.
We were able to narrow it down to a single query, not a stored procedure. Running this query in SQL Studio also runs quickly. We tried running the application in a different server environment. So a different web server with the same query, parameters and database. The query still ran slow.
The query is a fairly large one related to determining a student's current list of training. It includes joins and left joins on a dozen tables and subqueries. A few of the tables are fairly large (hundreds of thousands of rows) and some of the other tables are small lookup tables. The query uses a grouping clause and a few where conditions. A few of the tables are quite active and the contents change often but the volume of added rows doesn't seem extreme.
These symptoms led us to consider the execution plan. First off, as soon as we reset the execution plan cache with the SQL command 'DBCC FREEPROCCACHE', the problem went away. Unfortunately, the problem started to reoccur within a few days.
The problem has continued to plague us for awhile now. It's usually the same query, but we did appear to see the problem occur in another single query recently. It happens enough to be a nuisance.
We're having a heck of a time trying to fix it since we can't reproduce it in any other environment other than production. I have downloaded the High Availability guide from Red Gate and I read up more on execution plans. I hope to run the profiler on the live server, but I'm a bit concerned about impact. I would like to ask - what is the best way to figure out what is triggering this problem? Has anyone else seen this same issue?