Here are two queries that return the same resultset, but which is the optimal statement or doesn’t it matter?
SELECT A.id, B.somefield FROM (
SELECT id from table1
UNION
SELECT id from table2
) A LEFT JOIN table3 B on A.id = B.id
or
SELECT A.id, B.somefield FROM table1 A LEFT JOIN table3 B on A.id = B.id
UNION
SELECT A.id, B.somefield FROM table2 B LEFT JOIN table3 B on A.id = B.id
I realise I could pump them full of data and run some tests, but I am as much interested in the ‘why’ if one is faster? (I am using postgresql, in case it influences things).
Thanks.
OK, first off,
idin the select list is ambiguous; do we wantA.idorB.id?Second, assuming id is an indexed field in all tables, de-duping and joining are both NlogM operations, where N is the number of rows on the “left” side and M the number of rows on the “right” side. For each row in N, a matching row in M must be found or not found (when joining, rows found in M are included in the results; when unioning, rows found in M are excluded). This would mean that minimizing the cardinality of the left side will give the greatest performance.
So, the complexity of either query pretty much depends on how many shared IDs there are between table 1 and table 2. With zero commonality (no rows IDs the same) and 100 rows per table, the first query will perform one 100log100 union and then a 200log100 join, and the second query will perform two 100log100 joins and then a 100log100 union, which would execute in equivalent time. However, with 100% commonality (every row in table 1 is also in 2), the first query will perform a 100log100 union, then a 100log100 join (as the UNION of 1 and 2 would be equivalent to table 1), while the second query will still perform two 100log100 joins and a 100log100 union. As the worst-cases are the same but the best-case of query 1 is two-thirds that of query 2, I’d go for query 1.
However, as the commenter said, if you don’t expect any dupes, a UNION ALL will perform better in both queries. The result of a UNION ALL of A and B is A+B, which is bound only by the access time of each set (which I haven’t been considering). By not expecting dupes, both of the queries can be cut to the best-case performance of the first query.