Here is a simplified version of my data:
products:
+----+-----------+
| id | name |
+----+-----------+
| 1 | Product X |
| 2 | Product Y |
| 3 | Product Z |
+----+-----------+
categories:
+----+---------------+
| id | name |
+----+---------------+
| 1 | Hotel |
| 2 | Accommodation |
+----+---------------+
category_product
+----+------------+-------------+
| id | product_id | category_id |
+----+------------+-------------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
| 4 | 3 | 2 |
+----+------------+-------------+
How do I construct an efficient query that will only retrieve products that have both categories “Hotel” and “Accommodation” related (eg. Product X)?
I first tried a join approach
SELECT *
FROM products p
JOIN category_product cp
ON p.id = cp.product_id
WHERE cp.category_id = 1 OR cp.category_id = 2
^ This doesn’t work because it doesn’t contrain the query to containing both.
I have found an approach using sub-queries that works… but I’ve been warned against sub-queries for performance reasons:
SELECT *
FROM products p
WHERE
(
SELECT id
FROM category_product
WHERE product_id = p.id
AND category_id = 1
)
AND
(
SELECT id
FROM category_product
WHERE product_id = p.id
AND category_id = 2
)
Are there any better solutions (or how about alternatives)? I’ve considered de-normalizing categories to an extra column on products but would ideally like to avoid that. Hoping for a magic bullet solution!
UPDATE
I’ve run some of the (great) solutions provided in the answers:
My data is 235 000 category_product rows and 58 000 products and obviously benchmarks are always dependent on environment and indexes etc.
“Relational division” @podiluska
2 categories: 2826 rows ~ 20ms
5 categories: 46 rows ~ 25-30 ms
8 categories: 1 rows ~ 25-30 ms
“Where exists” @Tim Schmelter
2 categories: 2826 rows ~ 5-7ms
5 categories: 46 rows ~ 30 ms
8 categories: 1 rows ~ 300 ms
One can see the results start to diverge with having a greater number of categories thrown in. I’ll look at using “relational division” as it provides consistent results but implementation might cause me to look at “where exists” too (long format http://pastebin.com/6NRX0QbJ)
This technique is called “relational division”