I’ve found myself in a bit of a predicament. I have a table used for page hit tracking with nearly 105 million rows.(!) It looks like this:
CREATE TABLE `media_hits` (
`id` int(10) unsigned NOT NULL auto_increment,
`media_code` char(7) NOT NULL,
`day` date NOT NULL,
`hits` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `media_code` (`media_code`,`day`)
) ENGINE=InnoDB;
As you can imagine running any kind of query on this table takes a long time. A typical query would be the following:
SELECT DISTINCT(`media_code`), COUNT(*) AS c
FROM `media_hits`
WHERE `day` >= DATE_SUB(NOW(), INTERVAL 1 DAY)
GROUP BY(`media_code`)
ORDER BY c DESC
LIMIT 200;
This query takes forever. And EXPLAIN on the query gives me this:
id: 1
select_type: SIMPLE
table: media_hits
type: index
possible_keys: NULL
key: media_code
key_len: 10
ref: NULL
rows: 104773158
Extra: Using where; Using index; Using temporary; Using filesort
That’s just plain awful. So my question is: What can I do about this? Trying to add proper indexes now is impossible. The ALTER TABLE query would probably take over a week to run. I tried deleting rows older than 6 months, but 24 hours later that query was still running.
I need to fix this some how. The only thing that crosses my mind is creating a new table with proper indexes, and start recording hits in that table. In the background I could have a script slowly inserting records from the old media_hits table. Can anyone offer suggestions on how to index this table, and possibly some hints on which columns I should index?
For this kind of job, indexing alone will most probably not help you much. Better think of some kind of caching strategy with some additional tables storing the aggregates you need.
For example, for your query above, you might add a second table “media_code_per_day” containing 3 columns “media_code”, “counter” and “date”. Every time you insert a row into you original table, also update “media_code_per_day” accordingly. Then you can run a new query on “media_code_per_day” instead of your original query.
Of course, to initialize your new table in your situation, you will have to make one batch run going through all your existing rows once, but that is only needed once.