I have a PostgreSQL database which contains a lot of rows ~160,000,000. The database is basically a set of network logs. There is a column called time which is a time stamp. Basically for every unique time there is a log for each machine on the network. Thus there are multiple rows with the same time stamp. i.e.
time ip value
2:00 192.168.1.1 5
2:00 192.168.1.2 4
2:00 192.168.1.3 5
3:00 192.168.1.1 3
3:00 192.168.1.2 5
3:00 192.168.1.3 2
etc etc
Now I have created an index for this database by time as this is the field which I always query with relation to.
i.e. select * from networklogs where time = '2:00' and value = 5
Without using the index every query takes approximately 2 mins due to the query having to scan the whole database. I created the index such that my query will find the subset of data containing a matching time and then filter based on another value from here and thus I hoped the query would be a lot faster.
However the queries seem to take longer now…
This is strange because in pgAdmin it explains that the query will use the time index and that the time index narrows the data to ~800,000 items from a total 160,000,000, and then the filter on the query narrows this data down to 1,700 items. This explanation takes 17ms however if I run the query it takes 3 mins.
This should be faster than searching for two matching criteria on 160,000,000 items!
I cannot work out why it isn’t faster, I’m wondering if even though pgAdmin explains the query as using the time index does it actually use it when the query is performed?
Does anyone have any ideas or suggestions?
Thanks,
James
UPDATE: I have run an EXPLAIN ANALYZE on the following command:
explain analyze select latitude,longitude from networklogs where value = 5 and time = '2:00'
The results are:
"Bitmap Heap Scan on networklogs (cost=13178.17..1488620.94 rows=848 width=19) (actual time=3507.923..278760.509 rows=33 loops=1)"
" Recheck Cond: (time = '2:00'::timestamp without time zone)"
" Rows Removed by Index Recheck: 38302021"
" Filter: (value = 5)"
" Rows Removed by Filter: 882873"
" -> Bitmap Index Scan on timeindex (cost=0.00..13177.95 rows=530111 width=0) (actual time=695.771..695.771 rows=882906 loops=1)"
" Index Cond: (time = '2:00'::timestamp without time zone)"
"Total runtime: 278764.148 ms"
If these are mostly the parameters you are going to perform searchs against, I would advise to create a multiple index on them. Because, in your actual situation, the database has an index on time so it is able to know which rows match this condition but then it has to fetch them from all the different places given by the index and then check the secondary condition.
Using this index, the database will not have to fetch data in order to check another condition, it will simply fetch the data that it knows already match the search criteria. Order is important of course.
I see you are using a latitude and longitude data, maybe using the
pointtype would be good for that so you would be able to use all the geometric operators that come by default with Postgres. These kind of data can be indexed as well.