I have a strange performance problem with a query used to create a “filter by tags” widget for Delicious-like bookmarking webapp. The specific, relatively complex query performs much (1000 to 10000 times) faster if run as few, separate queries.
I’ve tested it on following environments:
- Windows XP / MySQL 5.1.37 (server & client)
- Ubuntu 11.10 / MySQL 5.1.58 (server & client)
The problem didn’t show up in small, development database. I caught it during production use, after large increase of records in database (currently about 100K rows in link_tags table & 11K unique tags).
I use following DB schema:
CREATE TABLE IF NOT EXISTS `link_tags` (
`link_id` int(11) NOT NULL,
`tag_id` int(11) NOT NULL,
UNIQUE KEY `link_tag_id` (`link_id`,`tag_id`),
KEY `tag_id` (`tag_id`),
KEY `link_id` (`link_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE IF NOT EXISTS `tags` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tag` varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `tag` (`tag`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
The schema is straightforward (see also http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html), so it shouldn’t require further explanation.
Technically speaking, the problematic query (below) retrieves tags related to given set of tags (specifically, all tags attached to links tagged by specified set of tags) and counts number of links for each found tag AND set of tags.
[ORIGINAL QUERY]
SELECT COUNT(*) AS link_count, tag FROM (
SELECT
t.tag AS tag,
CONCAT(lt.tag_id,':',lt.link_id) AS tag_link_hash
FROM
link_tags lt, tags t
WHERE
t.id = lt.tag_id
AND lt.link_id IN (
SELECT
link_id
FROM
link_tags lt2, links l2
WHERE
l2.id = lt2.link_id
AND l2.created_by = ? <-- user to filter tags for
AND lt2.tag_id IN (
SELECT id FROM tags t2 WHERE tag IN (?) <-- tags set to filter by
)
GROUP BY
link_id
HAVING
COUNT(*) = ?) <-- number of tags in filter
GROUP BY
tag_link_hash) tmp
GROUP BY
tag
ORDER BY
link_count DESC,
tag ASC
[Results in X minutes - up to 4 hours]
In production database (as I mentioned – about 100K link_tags and 11K tags) the query runs in minutes to hours (depends on occurrence frequency of specified tags).
Strangely, everything goes smooth if I separate it into few queries:
1) Find ids for given tag names.
[REPLACEMENT QUERY 1]
SELECT id FROM tags t2 WHERE tag IN (?)
[Results in 0,0011 seconds]
2) Find all link_ids for given set of tags (intersection!).
[REPLACEMENT QUERY 2]
SELECT
link_id
FROM
link_tags lt2, links l2
WHERE
l2.id = lt2.link_id
AND l2.created_by = 1
AND lt2.tag_id IN ( ? ) <-- here goes imploded result of query 1
GROUP BY
link_id
HAVING
COUNT(*) = ? <-- number of tags
[Results in 0,0996 seconds]
3) Find all tags for given set of link_ids and group tags by count of links.
[REPLACEMENT QUERY 3]
SELECT COUNT(*) AS link_count, tag FROM (
SELECT
t.tag AS tag,
CONCAT(lt.tag_id,':',lt.link_id) AS tag_link_hash
FROM
link_tags lt, tags t
WHERE
t.id = lt.tag_id
AND lt.link_id IN ( ? ) <-- here goes imploded result of query 2
GROUP BY
tag_link_hash) tmp
GROUP BY
tag
ORDER BY
link_count DESC,
tag ASC
[Results in 0,0543 seconds]
Do you have any idea what is going on? EXPLAIN shows roughly the same plans for large query as for the sum of separated ones. The difference is in number of rows processed in each step (and this is also strange).
Could you help to rewrite original query, hint the MySQL optimizer to run it efficiently or point me to the MySQL bug that causes this behavior?
EXPLAIN results for original query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL N8LL N8LL N8LL N8LL 32 Using temporary; Using filesort
2 DERIVED lt index tag_id link_tag_id 8 N8LL 78162 Using where; Using index; Using temporary; Using filesort
2 DERIVED t eq_ref PRIMARY PRIMARY 4 lstack_prod.lt.tag_id 1
3 DEPENDENT t2 range PRIMARY,tag tag 767 N8LL 2 Using where; Using temporary; Using filesort
SUBQUERY
3 DEPENDENT lt2 ref link_tag_id, tag_id 4 lstack_prod.t2.id 7
SUBQUERY tag_id,link_id
3 DEPENDENT l2 eq_ref PRIMARY, PRIMARY 4 lstack_prod.lt2.link_id 1 Using where
SUBQUERY created_by
the
WHERE IN (select values from table)is extremely inefficient in MySQL, and will trigger full table scans and file sorts all the time. Generally, you should replace these with an INNER JOIN.I THINK this should help, but I haven’t tried to re-create your DB, and haven’t run this query, so there might be typos.
However, an explain plan would be very helpful.