We have two Tables:
- Document: id, title, document_type_id, showon_id
- DocumentType: id, name
- Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN ( SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42 ); SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN ( SELECT Document.document_type_id FROM Document WHERE showon_id = 42 );
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that’s what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)
From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it’s more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.