I attend a database course at my school. The teacher gave us a simple exercise: consider the following, simple schema:
Table Book:
Column title (primary key)
Column genre (one of: "romance", "polar", ...)
Table Author:
Column title (foreign key on Book.title)
Column name
Primary key on (title, name)
Among the questions was the following one:
Write the query that returns the authors who have written romance books.
I proposed this answer:
select distinct name
from Author where title in (select title from Book where genre = "romance")
However the teacher said it was wrong, and that the correct answer was:
select distinct name
from Book, Author
where Book.title = Author.title
and genre = "romance"
When I asked for explanations all I got was a “if you had paid more attention to the course you would know why”. Brilliant.
So, why is my answer incorrect? What exactly is the difference between these queries? What exactly do they do, on the DB engine level?
You answer is correct.
My guess why the teacher marked it as wrong, that he/she tried to practise the use of joins with that question. But that should have been part of the question if it was intended.
Technically they are different indeed. A DBMS with a simple query optimizer will retrieve the subselect in a different way than the join from your teacher’s answer.
I wouldn’t be surprised if a DBMS with good optimizer might actually come up with the same execution plan for both queries.
Edit
I created some testdata with 50000 books, 50000 authors and 7 different genres to test (smaller numbers don’t really make sense as the optimizers tend to simply grab the whole table then). The statement would return 7144 rows.
PostgreSQL
The execution plans are nearly identical with some small change in the “join” method.
Here is the plan for the sub-select version: http://explain.depesz.com/s/eov
Here is the plan for the join version: http://explain.depesz.com/s/aTI
Surprisingly, the join version has a slightly higher cost value.
Oracle
Both plans are 100% identical:
Looking at the statistics when using
autotracethere is also no difference whatsoever. I didn’t bother to actually create a trace file to analyze it as I don’t expect to see a difference there.Things don’t really change if an index on
book.genreis added. Oracle sticks with the full table scan (even with 100000 rows). Probably because the tables are not very wide and a lot of rows fit on a single page.PostgreSQL does use the index for both statements but there is still no real difference between the plans.