Using Postgres v. 8.4.2.
I have a standard three-table setup for a many-to-many relationship. A query, written against two of these tables, is supposed to return a single record for which there is no existing mapping:
SELECT b.id
FROM scm_branch b
LEFT OUTER JOIN "scm_branchgroup_branches" bgb
ON b.id = bgb.branch_id
WHERE
bgb.branch_id is NULL
LIMIT 1
(scm_branchgroup_branches is the ternary table keeping the mapping records)
this query has the following top level cost:
Limit (cost=0.00..0.22 rows=1 width=4)
However if I change the field used in the WHERE condition to be the primary key of the bgb table, the cost goes drastically up:
SELECT b.id
FROM scm_branch b
LEFT OUTER JOIN "scm_branchgroup_branches" bgb
ON b.id = bgb.branch_id
WHERE
bgb.id IS NULL
LIMIT 1
(note: this query uses “bgb.id IS NULL” vs “bgb.branch_id is NULL” used in the previous example)
Top level cost:
Limit (cost=236366.74..4644900.75 rows=1 width=4)
Why such a drastic difference? I mean I see the difference in execution plans, but I don’t understand the underlying reasons for the difference.
I’ll be happy to provide any additional info needed to answer this question.
Thanks
D.
In the version with
branch_id IS NULL, PostgreSQL recognizes that you’re doing an anti-join, and it optimizes accordingly, using a “hash-join”-type algorithm. (GooglePostgreSQL hash left anti-joinfor lots of information on this optimization.) In the version withid IS NULL, however, it doesn’t recognize this fact, and its optimizations are not nearly so helpful. What’s more, the join basically forces it to use the index onbranch_id, so it can’t take advantage of the primary-key index onid, and must consult the actual table data to determine if a given record satisfies the WHERE-clause.(There may well be other factors at play as well — I’m not an expert on PostgreSQL — but I believe those are the main factors.)