This recent question had me thinking about optimizing a category filter.
Suppose we wish to create a database referencing a huge number of audio tracks, with their release date and a list of world locations from which the audio track is downloadable.
The requests we wish to optimize for are:
- Give me the 10 most recent tracks downloadable from location A.
- Give me the 10 most recent tracks downloadable from locations A or B.
- Give me the 10 most recent tracks downloadable from locations A and B.
How would one go about structuring that database ? I have a hard time coming up with a simple solution that doesn’t require reading through all the tracks for at least one location…
To optimise these queries, you need to slightly de-normalise the data.
For example, you may have a
tracktable that contains the track’sid,nameandrelease date, and amap_location_to_tracktable that describes where those tracks can be down-loaded from. To answer “10 most recent tracks for location A” you need to get ALL of the tracks for Location A frommap_location_to_track, then join them to thetracktable to order them byrelease date, and pick the top 10.If instead all the data is in a single table, the ordering step can be avoided. For example…
Having location_id as the first entry in the primary key ensures that the WHERE clause is simply an index seek. Then there is no requirement to re-order the data, it’s already ordered for us by the primary key, but instead just pick the 10 records at the end.
You may indeed still join on to the
tracktable to get the name, price, etc, but you now only have to do that for 10 records, not everything at that location.To solve the same query for “locations A OR B”, there are a couple of options that can perform differently depending on the RDBMS you are using.
The first is simple, though some RDBMS don’t play nice with IN…
The next option is nearly identical, but still some RDBMS don’t play nice with OR logic being applied to INDEXes.
In either case, the algorithm being used to rationalise the list of records down to 10 is hidden from you. It’s a matter of try it and see; the index is still available such that this CAN be performant.
An alternative is to explicitly determine part of the approach in your SQL statement…
It is still possible for an optimiser to realise that these two unioned data sets are ordered, and so make the external order by very quick. Even if not, however, ordering 20 items is pretty quick. More importantly, it’s a fixed overhead: it doesn’t matter if you have a billion tracks in each location, we’re just merging two lists of 10.
The hardest to optimise is the AND condition, but even then the existance of the “TOP 10” constraint can help work wonders.
Adding a HAVING clause to the
INorORbased approaches can solve this, but, again, depending on your RDBMS, may run less than optimally.The alternative is to try the “two queries” approach…
This time we can’t restrict the two sub-queries to just 10 records; for all we know the most recent 10 in location a don’t appear in location b at all. The primary key rescues us again though. The two data sets are orgnised by release date, the RDBMScan just start at the top record of each set and merge the two until it has 10 records, then stop.
NOTE: Because the
release_dateis in the primary key, and before thetrack_id, one should ensure that it is used in the join.Depending on the RDBMS, you don’t even need the sub-queries. You may be able to just self-join the table without altering the RDBMS’ plan…
All in all, the combination of three things makes this pretty efficient:
– Partially De-Normalising the data to ensure it’s in a friendly order for our needs
– Knowing we only ever need the first 10 results
– Knowing we’re only ever dealing with 2 locations at the most
There are variations that can optimise to any number of records and any number of locations, but these are significantly less performant than the problem stated in this question.