I have a system that saves statistics for servers in a network. Later a user is able to consume all the data and plan their growth. Thus it is important to summarize the data into a graph ie across an hour, day, week, year, etc.
I’m trying to do something like this:
select created_time / 60, count(*)
from pm_server_stat
group by (created_time / 60);
--with this index
CREATE INDEX pm_server_stat_created_time_60
ON pm_server_stat
USING btree
((created_time / 60));
This is the explain i get
"GroupAggregate (cost=189822.36..213951.06 rows=1206435 width=8)"
" Output: ((created_time / 60)), count(*)"
" -> Sort (cost=189822.36..192838.45 rows=1206435 width=8)"
" Output: created_time, ((created_time / 60))"
" Sort Key: ((pm_server_stat.created_time / 60))"
" -> Seq Scan on public.pm_server_stat (cost=0.00..34967.44 rows=1206435 width=8)"
" Output: created_time, (created_time / 60)"
Does anyone know why this happens? I suspect that the types might be different?
PostgreSQL doesn’t have “covering” indexes in 9.1 or before. That means it’s going to have to access the rows anyway, in which case it might as well scan them. They’re due to appear in 9.2 (currently in beta testing if you want to try it out) but I’m not sure they’d be smart enough for this anyway.
It’ll never work once you want “total files served” or “total packets transmitted” anyway.
Typically, for this sort of summarizing task you’d have one or more summary tables: stats_minute, stats_hour, stats_day, stats_week etc. How many you’d have would depend on total data size / performance requirements. Keep the summaries up to date with a simple cron-job. If data is going to be coming in with “late” timestamps you might need a slight lag or allow for recalculation.
Then you can just have a union of the summary table with an actual sum of all the rows since the start of the current hour. That’s much less data to query and can be as fast as you might need.