I have 2 large mysql tables: Articles and ArticleTopics. I want to query the DB and retrieve the last 30 articles published for a given topicID. My current query is rather slow. Any ideas on how to improve it?
More details:
The tables:
Articles (~1 million rows)
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| articleId | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | NO | | NULL | |
| content | longtext | NO | | NULL | |
| pubDate | datetime | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+----------------+
ArticleTopics (~10 million rows)
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| articleId | int(11) | NO | MUL | NULL | |
| topicId | int(11) | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+-------+
And my query:
SELECT a.articleId, a.pubDate
FROM Articles a, ArticleTopics t
WHERE t.articleId=a.articleId AND t.topicId=3364
ORDER BY a.pubDate DESC LIMIT 30;
And the EXPLAIN of the query:
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| 1 | SIMPLE | t | ref | articleId,topicId,topicId_articleId | topicId_articleId | 4 | const | 4281 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,articleId_pubDate | PRIMARY | 4 | t.articleId | 1 | |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
The slowness, I believe, is coming from the ORDER BY a.pubDate DESC. I can greatly improve performance by faking it a bit by instead doing an ORDER BY t.articleId DESC and having an index in ArticleTopics on both articleId & topicId, since in general, the articleIds are in the same order as pubDates. They are not always, however, so it’s not ideal. I’d like to be able to sort it on the pubDate.
Update: Added EXPLAIN.
You can rewrite the query in various ways to see if it speeds things up:
Or:
The important index for both queries is on Articles, and contains articleId as first field.
If article is a large table, with say the entire PDF in binary, you can create an index that fully covers the query. Full coverage means all selected fields are part of the index. For this query, a fully covering index would be (articleId, pubDate).