I understand subqueries are notoriously bad for performance when used incorrectly. I have a very specific scenario where the user needs to retrieve a filtered set of records from a table. A wide variety of filters will be available and they must support composition. Moreover, new filters will be created on a regular basis by a group of developers.
I don’t like the idea of one growing, monolithic SQL query with a gross number of parameters. I don’t like the idea of a bunch of autonomous SQL queries with identical SELECT statements and varying WHERE clauses. I DO like the idea of a dynamic SQL query, but I’m not sure what kind of structure I should use. I can think of 4 basic options: (if there are more that I’m missing, then please don’t hesitate to suggest them)
- “INNER JOIN”: Concatenate filters via INNER JOINS to filter results.
- “FROM subqueries”: Stack filters via subqueries in the FROM statement.
- “WHERE subqueries”: Contatenate filters via subqueries in the WHERE clause.
- “INNER JOIN subqueries”: A wierd hybrid.
I’ve created a SQL fiddle to demonstrate (and profile) them:
The below is an excerpt from the fiddle to provide an idea of what I’m talking about:
------------------------------------------------------------------------
--THIS IS AN EXCERPT FROM THE SQL FIDDLE -- IT IS NOT MEANT TO COMPILE--
------------------------------------------------------------------------
--
--"INNER JOIN" test
SELECT COUNT(*)
FROM
@TestTable Test0
INNER JOIN @TestTable Test1 ON Test1.ID=Test0.ID AND Test1.ID % @i = 0
INNER JOIN @TestTable Test2 ON Test2.ID=Test0.ID AND Test2.ID % @j = 0
INNER JOIN @TestTable Test3 ON Test3.ID=Test0.ID AND Test3.ID % @k = 0
--
--"FROM subqueries" test
SELECT COUNT(*) FROM (
SELECT * FROM (
SELECT * FROM (
SELECT * FROM @TestTable Test3 WHERE Test3.ID % @k = 0
) Test2 WHERE Test2.ID % @j = 0
) Test1 WHERE Test1.ID % @i = 0
) Test0
--
--"WHERE subqueries" test
SELECT COUNT(*)
FROM @TestTable Test0
WHERE
Test0.ID IN (SELECT ID FROM @TestTable Test1 WHERE Test1.ID % @i = 0)
AND Test0.ID IN (SELECT ID FROM @TestTable Test2 WHERE Test2.ID % @j = 0)
AND Test0.ID IN (SELECT ID FROM @TestTable Test3 WHERE Test3.ID % @k = 0)
--
--"INNER JOIN subqueries" test
SELECT COUNT(*)
FROM
TestTable Test0
INNER JOIN (SELECT ID FROM TestTable WHERE ID % @i = 0) Test1 ON Test1.ID=Test0.ID
INNER JOIN (SELECT ID FROM TestTable WHERE ID % @j = 0) Test2 ON Test2.ID=Test0.ID
INNER JOIN (SELECT ID FROM TestTable WHERE ID % @k = 0) Test3 ON Test3.ID=Test0.ID
--
--"EXISTS subqueries" test
SELECT COUNT(*)
FROM TestTable Test0
WHERE
EXISTS (SELECT 1 FROM TestTable Test1 WHERE Test1.ID = Test0.ID AND Test1.ID % @i = 0)
AND EXISTS (SELECT 1 FROM TestTable Test2 WHERE Test2.ID = Test0.ID AND Test2.ID % @j = 0)
AND EXISTS (SELECT 1 FROM TestTable Test3 WHERE Test3.ID = Test0.ID AND Test3.ID % @k = 0)
Rankings (time to execute the tests)
SQL Fiddle:
|INNER JOIN|FROM SUBQUERIES|WHERE SUBQUERIES|INNER JOIN SUBQUERIES|EXISTS SUBQUERIES|
-------------------------------------------------------------------------------------
| 5174 | 777 | 7240 | 5478 | 7359 |
Local Environment: (with no cache: clearing buffer before every test)
|INNER JOIN|FROM SUBQUERIES|WHERE SUBQUERIES|INNER JOIN SUBQUERIES|EXISTS SUBQUERIES|
-------------------------------------------------------------------------------------
| 3281 | 2851 | 2964 | 3148 | 3071 |
Local Environment: (with cache: running queries twice in a row and record the time of the 2nd run)
|INNER JOIN|FROM SUBQUERIES|WHERE SUBQUERIES|INNER JOIN SUBQUERIES|EXISTS SUBQUERIES|
-------------------------------------------------------------------------------------
| 284 | 50 | 3334 | 278 | 408 |
There are advantages / disadvantages with each solution. The subqueries in the WHERE clause have pretty terrible performance. The subqueries in the FROM clause have pretty good performance (actually they usually perform the best) (NOTE: I believe this method would negate the benefits of indices?). The INNER JOINs have pretty good peformance, though it introduces some interesting scoping issues because unlike the subqueries, the INNER JOINs would be operating in the same context (there would have to be an intermediary system to avoid the collision of table aliases).
Overall I think the cleanest solution is subqueries in the FROM clause. The filters would be easy to write and test (because unlike the INNER JOINs they wouldn’t need to be provided with context / base query).
Thoughts? Is this a valid usage of subqueries or a disaster waiting to happen?
UPDATE (2012/10/04):
- Updated SQL Fiddle to include a test for “EXISTS” method
- Added performance test from SQL Fiddle and local environment
If you’re always going to apply “and” logic the inner join is probably a good approach (I’m generalizing, but it will vary by a lot of factors including your table size and indexes, etc.). You will need to use one of the other solutions if you want to be able to apply “and” or “or” filtering.
Also, you should test out the performance using exists clauses: