I have a table schema similar to the following (simplified):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor’s database (we’ll call it an ETL process), and I therefore don’t have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID for any given customer is not necessarily the most recent transaction.
In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique – the vendor often truncates the time of day – so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.
I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It’s also pretty awkward to keep writing all the time.
Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate) per CustomerID, and another to find the MAX(TransactionID). Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER() query but still rather painful performance-wise.
I’ve also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I’d prefer to find a pure SQL solution if possible to simplify the deployment (there’s no need for SQL-CLR anywhere else in this project).
So the question, specifically is:
Is it possible to write a query that will return the newest TransactionID per CustomerID, defined as the maximum TransactionID for the most recent TransactionDate, and achieve a plan equivalent in performance to an ordinary MAX/GROUP BY query?
(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)
Disclaimer: Thinking out loud 🙂
Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?