I have built a small inventory system using postgresql and psycopg2. Everything works great, except, when I want to create aggregated summaries/reports of the content, I get really bad performance due to count()’ing and sorting.
The DB schema is as follows:
CREATE TABLE hosts
(
id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
CREATE TABLE items
(
id SERIAL PRIMARY KEY,
description TEXT
);
CREATE TABLE host_item
(
id SERIAL PRIMARY KEY,
host INTEGER REFERENCES hosts(id) ON DELETE CASCADE ON UPDATE CASCADE,
item INTEGER REFERENCES items(id) ON DELETE CASCADE ON UPDATE CASCADE
);
There are some other fields as well, but those are not relevant.
I want to extract 2 different reports:
– List of all hosts with the number of items per, ordered from highest
to lowest count
– List of all items with the number of hosts per, ordered from highest to lowest count
I have used 2 queries for the purpose:
Items with host count:
SELECT i.id, i.description, COUNT(hi.id) AS count FROM items AS i LEFT JOIN host_item AS hi ON (i.id=hi.item) GROUP BY i.id ORDER BY count DESC LIMIT 10;
Hosts with item count:
SELECT h.id, h.name, COUNT(hi.id) AS count FROM hosts AS h LEFT JOIN host_item AS hi ON (h.id=hi.host) GROUP BY h.id ORDER BY count DESC LIMIT 10;
Problem is: the queries runs for 5-6 seconds before returning any data. As this is a web based application, 6 seconds are just not acceptable. The database is heavily populated with approximately 50k hosts, 1000 items and 400 000 host/items relations, and will likely increase significantly when (or perhaps if) the application will be used.
After playing around, I found that by removing the “ORDER BY count DESC” part, both queries would execute instantly without any delay whatsoever (less than 20ms to finish the queries).
Is there any way I can optimize these queries so that I can get the result sorted without the delay? I was trying different indexes, but seeing as the count is computed it is possible to utilize an index for this. I have read that count()’ing in postgresql is slow, but its the sorting that are causing me problems…
My current workaround is to run the queries above as an hourly job, putting the result into a new table with an index on the count column for quick lookup.
I use Postgresql 9.2.
Update: Query plan as ordered 🙂
EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;
Limit (cost=699028.97..699028.99 rows=10 width=21) (actual time=5427.422..5427.424 rows=10 loops=1)
-> Sort (cost=699028.97..699166.44 rows=54990 width=21) (actual time=5427.415..5427.416 rows=10 loops=1)
Sort Key: (count(hi.id))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=613177.95..697840.66 rows=54990 width=21) (actual time=3317.320..5416.440 rows=54990 loops=1)
-> Merge Left Join (cost=613177.95..679024.94 rows=3653163 width=21) (actual time=3317.267..5025.999 rows=3653163 loops=1)
Merge Cond: (h.id = hi.host)
-> Index Scan using hosts_pkey on hosts h (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.012..15.693 rows=54990 loops=1)
-> Materialize (cost=613177.95..631443.77 rows=3653163 width=8) (actual time=3317.245..4370.865 rows=3653163 loops=1)
-> Sort (cost=613177.95..622310.86 rows=3653163 width=8) (actual time=3317.199..3975.417 rows=3653163 loops=1)
Sort Key: hi.host
Sort Method: external merge Disk: 64288kB
-> Seq Scan on host_item hi (cost=0.00..65124.63 rows=3653163 width=8) (actual time=0.006..643.257 rows=3653163 loops=1)
Total runtime: 5438.248 ms
EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
LIMIT 10;
Limit (cost=0.00..417.03 rows=10 width=21) (actual time=0.136..0.849 rows=10 loops=1)
-> GroupAggregate (cost=0.00..2293261.13 rows=54990 width=21) (actual time=0.134..0.845 rows=10 loops=1)
-> Merge Left Join (cost=0.00..2274445.41 rows=3653163 width=21) (actual time=0.040..0.704 rows=581 loops=1)
Merge Cond: (h.id = hi.host)
-> Index Scan using hosts_pkey on hosts h (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.015..0.021 rows=11 loops=1)
-> Index Scan Backward using idx_host_item_host on host_item hi (cost=0.00..2226864.24 rows=3653163 width=8) (actual time=0.005..0.438 rows=581 loops=1)
Total runtime: 1.143 ms
Update: All the answers to this question is really good for learning and understanding how Postgres works. There does not seem to be any definite solution to this problem, but I really appreciate all the excellent answers you have provided, and I will use those in my future work with Postgresql. Thanks alot guys!
@Gordon and @willglynn have provided a lot of useful background as to why your query is slow.
A workaround would be to add a counter to the tables
itemsandhostsand triggers that keep them up to date – for a non-trivial cost to write operations.Or use materialized views like you do. I might opt for that.
For that, you still need to execute these queries on a regular basis and they can be improved. Rewrite your first one to:
If there is a row in table
itemsfor most rows in tablehost_item, it is faster to aggregate first and thenJOIN. Contrary to what @willglynn speculates, this is not optimized automatically in Postgres 9.1.count(*)is faster thancount(col)on principal – and equivalent whilecolcannot be NULL. (ALEFT JOINmight introduce NULL values.)Simplified
LEFT JOINtoJOIN. It should be safe to assume that there are always at least ten distinct hosts. Doesn’t matter much for your original query, but it’s a requirement for this one.Indexes on table
host_itemwon’t help, and the PK onitemscovers the rest.Probably still not good enough for your case, but in my tests with Postgres 9.1 this form is more than twice as fast. Should translate to 9.2, but test with
EXPLAIN ANALYZEto be sure.