Given:
Table y
id int clustered indexname nvarchar(25)
Table anothertable
id int clustered Indexname nvarchar(25)
Table someFunction
- does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Generally speaking
INis different fromJOINin that aJOINcan return additional rows where a row has more than one match in theJOIN-ed table.From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
versus
Even if duplicates are introduced by the
JOINthen they will be removed by theGROUP BYas it only references columns from the left hand table. Additionally these duplicate rows will not alter the result asMAX(A.Col2)will not change. This would not be the case for all aggregates however. If you were to useSUM(A.Col2)(orAVGorCOUNT) then the presence of the duplicates would change the result.It seems that SQL Server doesn’t have any logic to differentiate between aggregates such as
MAXand those such asSUMand so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.The estimated number of rows being aggregated is
2893.54forINvs28271800forJOINbut these estimates won’t necessarily be very reliable as the join predicate is unsargable.