i have a table:
CREATE TABLE `p` (
`id` bigint(20) unsigned NOT NULL,
`rtime` datetime NOT NULL,
`d` int(10) NOT NULL,
`n` int(10) NOT NULL,
PRIMARY KEY (`rtime`,`id`,`d`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
and i have a query:
select id, d, sum(n) from p where rtime between '2012-08-25' and date(now()) group by id, d;
i’m running explain on this query on a tiny table (2 records) and it tells me it’s going to use my PK:
id | select_type | table | type | possible_keys key | key | key_len | ref | rows | Extra
1 | SIMPLE | p | range | PRIMARY | PRIMARY | 8 | NULL | 1 | Using where; Using temporary; Using filesort
but when i use the same query on the same table – only this time it’s huge (350 million records) – it prefers to go through all the records and ignore my keys
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | p | ALL | PRIMARY | NULL | NULL | NULL | 355465280 | Using where; Using temporary; Using filesort
obviously, this is extremely slow..
can anyone help?
EDIT: this simple query is also taking a significant amount of time:
select count(*) from propagation_delay where rtime > '2012-08-28';
Your query:
employs rtime, and groups by id and d. At a minimum you ought to index by
rtime. You might also want to try indexing byrtime, id, d, nin this order, but when you do, you see that your index will contain more or less the same data as your table.Probably, the optimizer does some calculations and comes to the conclusion that it’s not really worthwhile to employ the index.
I’d leave an index on
rtimealone. The real clincher is how many records match theWHERE– if they’re just a few, it is convenient to read the index and hop around the table. If they’re several, maybe it’s better to sequentially scan the whole table, saving on the to-and-fro reads.Okay, then it is likely that the cumulative cost of quickly extracting a half dozen million records from the index, and then shuttling to and fro from the main table to recover that half dozen million records, is more than the cost of opening the main table, and trawling through all 350M records grouping and summing along the way.
In such a scenario, if you always (or mostly) run aggregate queries on
rtime, AND the table is an accumulating (historical) table, AND each couple(id, d)sees several scores of entries per day, you might consider creating an aggregate by date secondary table. I.e., at (say) midnight, you run a query andThe data in
aggregate_tablehas one entry only per each couple(id, d)holding the sum onnfor that day; the table is proportionately smaller, and queries faster. This assumes that you have a comparatively small number of(id, d)and each of them generates lots of rows in the main table each day.With one logging per minute per couple, aggregation should speed up things by more than three orders of magnitude (conversely, if you have the twice-daily take of a huge number of different sensors, the benefits will be negligible).