This scenario is based upon a schema in another question and I’m not interested in any discussion about the validity of the schema!
I’m interested to know if there are any good techniques in SQL Server to perform an aggregation of one column (amount1 below) based on the distinct value of another column (id1).
Plan1 below scans table1 twice, performs two aggregations by p_id then joins the result together. It seems as though this could be improved upon. Query 2 could return the wrong result in some circumstances and the plan is worse anyway!
Any ideas?
DDL
IF OBJECT_ID('tempdb..#table1') IS NOT NULL DROP TABLE #table1;
IF OBJECT_ID('tempdb..#table2') IS NOT NULL DROP TABLE #table2;
CREATE TABLE #table1 (id1 int primary key nonclustered, amount1 int, p_id int);
CREATE CLUSTERED INDEX ix ON #table1 (p_id,id1);
INSERT INTO #table1
SELECT 1,500,10 UNION ALL
SELECT 2,700,20 UNION ALL
SELECT 3,500,10 UNION ALL
SELECT 4,450,20 UNION ALL
SELECT 5,300,10;
CREATE TABLE #table2 (id2 int primary key, amount2 int, id1 int);
INSERT INTO #table2
SELECT 1,300,1 UNION ALL
SELECT 2,200,1 UNION ALL
SELECT 3,200,2 UNION ALL
SELECT 4,500,2 UNION ALL
SELECT 5,400,3 UNION ALL
SELECT 6,150,4 UNION ALL
SELECT 7,300,4 UNION ALL
SELECT 8,300,5;
Query 1
WITH t1
AS (SELECT p_id,SUM(amount1) AS total1
FROM #table1
GROUP BY p_id),
t2
AS (SELECT p_id,SUM(amount2) AS total2
FROM #table2 table2
JOIN #table1 table1
ON table1.id1 = table2.id1
GROUP BY p_id)
SELECT t1.p_id,total1,total2
FROM t1
JOIN t2
ON t1.p_id = t2.p_id
Plan 1
Query 2
SELECT table1.p_id,
FLOOR(SUM(DISTINCT amount1 + table1.id1/100000000.0)) AS total1,
SUM(amount2) AS total2
FROM #table1 table1 JOIN #table2 table2 ON table1.id1=table2.id1
GROUP BY table1.p_id


This one will scan each record in either table only once:
, though it’s not necessarily more efficient.
This one:
will do the same with more options for the joins, however, an extra spool may be used in the plan if
t2will be chosen leading.