I am using SQL Server to store data about ticket validation. Single ticket can be validated at multiple places. I need to group records by “entry” and “exit” place and calculate statistics about duration which has passed between two validations.
Here is the table (simplified for clarity) :
CREATE TABLE TestDuration
(VALIDATION_TIMESTAMP datetime,
ID_TICKET bigint,
ID_PLACE bigint)
And data:
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 19:24:05.700', 1, 1)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 20:08:04.250', 2, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:13.040', 3, 3)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:20.990', 1, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:29.290', 2, 4)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:25:37.040', 1, 4)
Here is the aggregation query:
SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN,
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
COUNT(visitduration) AS NUMBER_OF_VISITS, AVG(visitduration) AS AVERAGE_VISIT_DURATION
FROM (
SELECT EntryData.VALIDATION_TIMESTAMP, EntryData.ID_TICKET, EntryData.ID_PLACE,
(
SELECT TOP 1 ID_PLACE FROM TestDuration
WHERE ID_TICKET=EntryData.ID_TICKET
AND VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP
ORDER BY VALIDATION_TIMESTAMP ASC
)
AS ID_NEXT_VISIT_PLACE,
DATEDIFF(n,EntryData.VALIDATION_TIMESTAMP,
(
SELECT TOP 1 VALIDATION_TIMESTAMP FROM TestDuration WHERE ID_TICKET=EntryData.ID_TICKET and VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP ORDER BY VALIDATION_TIMESTAMP ASC
)
) AS visitduration
FROM TestDuration EntryData)
AS VisitDurationCalcTable
WHERE VisitDurationCalcTable.ID_NEXT_VISIT_PLACE IS NOT NULL
GROUP BY VisitDurationCalcTable.ID_PLACE, VisitDurationCalcTable.ID_NEXT_VISIT_PLACE
The query works, but I’ve hit a performance problem pretty fast. For 40K rows in table query execution time is about 3 minutes. I’m no SQL guru so cannot really see how to transform the query to work faster. It’s not a critical report and is made only about once per month, but nevertheless it makes my app look bad. I have a feeling I’m missing something simple here.
TLDR Version
You are clearly missing an index that would help this query. Adding the missing index will likely cause an order of magnitude improvement on its own.
If you are on SQL Server 2012 rewriting the query using
LEADwould also do it (though that too would benefit from the missing index).If you are still on 2005/2008 then you can make some improvements to the existing query but the effect will be relatively minor compared to the index change.
Longer Version
For this to take 3 minutes I assume you have no useful indexes at all and that the biggest win would be to simply add an index (for a report run once a month simply copying the data from the three columns into an appropriately indexed
#temptable might suffice if you don’t want to create a permanent index).You say that you simplified the table for clarity and that it has 40K rows. Assuming the following test data
Your original query takes 51 seconds on my machine at
MAXDOP 1and the following IO statsFor each of the 40,000 rows in the table it is doing two sorts of all matching
ID_TICKETrows in order to identify the next one in order ofVALIDATION_TIMESTAMPSimply adding an index as below brings the elapsed time down to 406ms, an improvement of more than 100 times (the subsequent queries in this answer assume this index is now in place).
The plan now looks as follows with the 80,000 sorts and spool operations replaced with index seeks.
It is still doing 2 seeks for every row however. Rewriting with
CROSS APPLYallows these to be combined.This gives me an elapsed time of 269 ms
Whilst the number of reads is still quite high the seeks are all reading pages that have just been read by the scan so they are all pages in cache. The number of reads can be reduced by using a table variable.
However for me at least that slightly increased the elapsed time to 301 ms (43 ms for the insert + 258 ms for the select) but this could still be a good option in lieu of creating a permanent index.
Finally if you are using SQL Server 2012 you can use
LEAD(SQL Fiddle)That gave me an elapsed time of 249 ms
The
LEADversion also performs well without the index. Omitting the optimal index adds an additionalSORTto the plan and means it has to read the wider clustered index on my test table but it still completed in an elapsed time of 293 ms.