I’m developing a query against a table that contains a bunch of points in a time series. The table can grow quite large, and so I want the query to effectively downsample the output by averaging points over fixed time intervals. After writing the query, I’m surprised by how SQL Server (2008) has opted to execute the query. The execution plan reveals an unnecessary sorting operation that would become expensive as the time series grows. Here is the problem, reduced to a simple example:
CREATE TABLE [dbo].[Example]
(
[x] FLOAT NOT NULL,
[y] FLOAT NOT NULL,
PRIMARY KEY CLUSTERED
(
[x] ASC
)
);
SELECT FLOOR([x]), AVG([y])
FROM [dbo].[Example]
GROUP BY FLOOR([x]);
Here I have (x,y) pairs that are already sorted by x (because of the clustered primary key), and I’m averaging y for each whole number x (by truncating with the FLOOR function). I would expect that the table is already suitably sorted for the aggregate since FLOOR is a monotonic function. Unfortunately, SQL Server decides that this data needs to be re-sorted, and here is the execution plan:

Shouldn’t SQL Server be able to perform a streaming aggregation over data grouped by a monotonic function of columns that are already suitably sorted?
Is there a general way to rewrite such queries so that SQL Server will see that the order is preserved?
[Update]
I’ve found an article on the subject Things SQL needs: sargability of monotonic functions and, as the title suggests, it seems like this is an optimization that SQL Server doesn’t yet do (in most cases).
Here are even simpler queries over [dbo].[Example] that demonstrate the point:
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY FLOOR([x]) --sort performed in execution plan
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x] --NO sort performed in execution plan
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x]+1 --sort performed in execution plan
In any single addition or multiplication, the query optimizer understands that the data already has the same order (and this is seen when you group by such expressions too). So it seems like the concept of monotonic functions is understood by the optimizer, just not generally applied.
I’m testing the computed column / index solution now, but it seems like this will dramatically increase the size of the persisted data since I will need several indices to cover the range of possible intervals.
Some notes:
I think you will have the best query performance if you do something like this:
Add some rows:
Query:
or
both will have the same plan
Plan: no sorting
Another option – you can create indexed view. In this case you will have to query the view directly, unless you have Enterprise Edition, which would use indexed view indexes even if you query table directly.
[Edit] Just realized I didn’t explicitly answer your question. You asked why would SQL perform sort if
Xis clustered primary key. SQL does not perform sort onX, it performs sort onfloor(x). In other words, ifxis already sorted, thenf(x)would not necessarily have the same order, right?