I know this has probably been asked before, but I can’t find it with SO’s search.
Lets say i’ve TABLE1 and TABLE2, how should I expect the performance of a query such as this:
SELECT * FROM TABLE1 WHERE id IN SUBQUERY_ON_TABLE2;
to go down as the number of rows in TABLE1 and TABLE2 grow and id is a primary key on TABLE1.
Yes, I know using IN is such a n00b mistake, but TABLE2 has a generic relation (django generic relation) to multiple other tables so I can’t think of another way to filter the data. At what (aproximate) ammount of rows in TABLE1 and TABLE2 should I expect to notice performance issues because of this? Will performance degrade linearly, exponentially etc. depending on the number of rows?
When the number of records returned by the subquery is small, and the resulting number of rows returned by the main query is also small, you’ll just get fast index lookups on each. As the percentage of the data returned increases, eventually each of the two will switch to using a sequential scan instead of an indexed one, to grab the whole table in one gulp rather than piece it together. It isn’t a simple fall-off in performance that’s either linear or exponential; there are major discontinuities as the type of plan changes. And the number of rows at which those happen depends on the size of the tables, so no useful rules of thumb for you there either. You should build a simulation like I’m doing below and see what happens on your own data set to get an idea what the curve looks like.
Here’s an example of how that works using a PostgreSQL 9.0 database loaded with the Dell Store 2 database. Once there’s 1000 rows being returned by the subquery, it’s doing a full table scan of the main table. And once the subquery is considering 10,000 records, that turns into a full table scan too. These were each run twice, so you’re seeing the cached performance. How performance changes based on cached vs. uncached status is a whole ‘nother topic altogether: