I have a very large web forum application (about 20 million posts since 2001) running from a SQL Server 2012 database. The data files are about 40GB in size.
I added indexes to the tables for appropriate fields, however this query (which reveals the date range of posts in each forum) takes about 40 minutes to run:
SELECT
T2.ForumId,
Forums.Title,
T2.ForumThreads,
T2.ForumPosts,
T2.ForumStart,
T2.ForumStop
FROM
Forums
INNER JOIN (
SELECT
Min(ThreadStart) As ForumStart,
Max(ThreadStop) As ForumStop,
Count(*) As ForumThreads,
Sum(ThreadPosts) As ForumPosts,
Threads.ForumId
FROM
Threads
INNER JOIN (
SELECT
Min(Posts.DateTime) As ThreadStart,
Max(Posts.DateTime) As ThreadStop,
Count(*) As ThreadPosts,
Posts.ThreadId
FROM
Posts
GROUP BY
Posts.ThreadId
) As P2 ON Threads.ThreadId = P2.ThreadId
GROUP BY
Threads.ForumId
) AS T2 ON T2.ForumId = Forums.ForumId
How could I speed it up?
UPDATE:
This is the Estimated Execution Plan, from right-to-left:
[Path 1]
Clustered Index Scan (Clustered) [Posts].[PK_Posts], Cost: 98%
Hash Match (Partial Aggregate), Cost: 2%
Parallelism (Repartition Streams), Cost: 0%
Hash Match (Aggregate), Cost 0%
Compute Scalar, Cost: 0%
Bitmap (Bitmap Create), Cost: 0%
[Path 2]
Index Scan (NonClustered) [Threads].[IX_ForumId], Cost: 0%
Parallelism (Repartition Streams), Cost: 0%
[Path 1 and 2 converge into Path 3]
Hash Match (Inner Join), Cost: 0%
Hash Match (Partial Agregate), Cost: 0%
Parallelism (Repartition Streams), Cost: 0%
Sort, Cost: 0%
Stream Aggregate (Aggregate), Cost: 0%
Compute Scalar, Cost: 0%
[Path 4]
Clustered Index Seek (Clustered) [Forums].[PK_Forums], Cost: 0%
[Path 3 and 4 converge into Path 5]
Nested Loops (Inner Join), Cost: 0%
Paralleism (Gather Streams), Cost: 0%
SELECT, Cost: 0%
I added some more indexes to the database and it sped things up considerably. Execution time is now about 20 seconds (!!). I’ll admit that a lot of the added indexes were guesswork (or just adding them randomly).