I’m dealing with a database where items are “tagged” a certain number of times.
item (100k rows)
- id
- name
- other stuff
tag (10k rows)
- id
- name
item2tag (1,000,000 rows)
- item_id
- tag_id
- count
I’m looking for the fastest solution to:
Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?
Here’s what I have so far… I’d just like to make sure I’m doing it in the best way possible:
First get the tag_ids from the names:
SELECT tag.id WHERE name IN ("X","Y","Z");
Then I group by those tag_ids and use Having to filter the result:
SELECT item2tag.*, count(tag_id)
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;
Then I can just select from item with those ids.
SELECT * FROM item WHERE id IN ([results from prior query])
I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?
The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:
You should ensure that you have the following indexes:
I performance tested this query against the original in a few different scenarios.
The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.
Warning: Don’t run this script on your production database as it modifies the contents of the
item2tagtable. Running the script can take a few minutes as it creates a lot of data.