I’m trying to determine the best general approach for querying against joined two tables that have a lot of data, where each table has a column in the where clause. Imagine a simple schema w/ two tables:
posts
id (int)
blog_id (int)
published_date (datetime)
title (varchar)
body (text)
posts_tags
post_id (int)
tag_id (int)
With the following indexes:
posts: [blog_id, published_date]
tags: [tag_id, post_id]
We want to SELECT the 10 most recent posts on a given blog that were tagged with “foo”. For the sake of this discussion, assume the blog has 10 million posts, and 1 million of those have been tagged with “foo”. What is the most efficient way to query for this data?
The naive approach would be to do this:
SELECT
id, blog_id, published_date, title, body
FROM
posts p
INNER JOIN
posts_tags pt
ON pt.post_id = p.id
WHERE
p.blog_id = 1
AND pt.tag_id = 1
ORDER BY
p.published_date DESC
LIMIT 10
MySQL will use our indexes, but will still end up scanning millions of records. Is there a more efficient way to retrieve this data w/o denormalizing the schema?
Most likely MySQL will first use the index
(blog_id, published_date)to scan all the rows satisfying the conditionblog_id = 1starting with the row with the newestpublished_date. To do this it just need to scan backwards through the index starting from the right place. For each row it must join to theposts_tagstable. At this point both thetag_idand thepost_idare known so it is just a lookup in the primary index to see if the row exists. 10% of the rows have the tagfooso on average about 100 rows in thepoststable will have to be checked before the first 10 rows of the result set are found.
I would expect the query you posted to run quite quickly if the tag
foois common. I don’t think it will check millions of rows – perhaps a few hundred, or a few thousand if you are unlucky. As soon as it has found 10 matching rows it can stop without checking any more rows.On the other hand, if you choose a tag that has fewer than 10 occurrences it will be slow as it will have to scan all the rows in that blog.
Do you have performance measurements that shows the query is particularly slow even when the tag you are searching for occurs often? Can you post the output of
EXPLAINfor the query?