I’m having a hard time figuring how to query/index a database.
The situation is pretty simple. Each time a user visits a category, his/her visit date is stored. My goal is to list the categories in which elements have been added after the user’s latest visit.
Here are the two tables:
CREATE TABLE `elements` (
`category_id` int(11) NOT NULL,
`element_id` int(11) NOT NULL,
`title` varchar(255) NOT NULL,
`added_date` datetime NOT NULL,
PRIMARY KEY (`category_id`,`element_id`),
KEY `index_element_id` (`element_id`)
)
CREATE TABLE `categories_views` (
`member_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
`view_date` datetime NOT NULL,
PRIMARY KEY (`member_id`,`category_id`),
KEY `index_element_id` (`category_id`)
)
Query:
SELECT
categories_views.*,
elements.category_id
FROM
elements
INNER JOIN categories_views ON (categories_views.category_id = elements.category_id)
WHERE
categories_views.member_id = 1
AND elements.added_date > categories_views.view_date
GROUP BY elements.category_id
Explained:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: elements
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 89057
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories_views
type: eq_ref
possible_keys: PRIMARY,index_element_id
key: PRIMARY
key_len: 8
ref: const,convert.elements.category_id
rows: 1
Extra: Using where
With about 100k rows in each table, the query is taking around 0.3s, which is too long for something that should be executed for every user action in a web context.
If possible, what indexes should I add, or how should I rewrite this query in order to avoid using filesorts and temporary tables?
If each member has a relatively low number of category_views, I suggest testing a different query:
For optimum performance of that query, you’d want to ensure you had indexes:
NOTE: It looks like the primary key on the
categories_viewstable may be(member_id, category_id), which means an appropriate index already exists.I’m assuming (as best as I can figure out from the original query) is that the
categories_viewstable contains only the “latest” view of the category for a user, that is,member_id, category_idis unique. It looks like that has to be the case, if the original query is returning a correct result set (if its only returning categories that have “new” elements added since the “last view” of that category by the user; otherwise, the existence of any “older”view_datevalues in thecategories_viewstable would trigger the inclusion of the category, even if there were a newerview_datethat was later than the latest (maxadded_date) element in a category.If that’s not the case, i.e.
(member_id,category_id)is not unique, then the query would need to be changed.The query in the original question is a bit puzzling, it references
element_viewsas a table name or table alias, but that doesn’t appear in the EXPLAIN output. I’m going under the assumption thatelement_viewsis meant to be a synonym forcategories_views.For the original query, add a covering index on the
elementstable:The goal there is to get the explain output to show “Using index”
You might also try adding an index:
To get all the columns from the categories_view table (for the select list), the query is going to have to visit the pages in the table (unless there’s an index that contains all of those columns. The goal would be reduce the number of rows that need to be visited on data pages to find the row, by having all (or most) of the predicates satisfied from the index.
Is it necessary to return the
category_idcolumn from theelementstable? Don’t we already know that this is the same value as in thecategory_idcolumn from thecategories_viewstable, due to the inner join predicate?