I have built a small inventory system using postgresql and psycopg2. Everything works great,

Question

0

Asked: June 13, 20262026-06-13T01:50:15+00:00 2026-06-13T01:50:15+00:00

I have built a small inventory system using postgresql and psycopg2. Everything works great,

0

I have built a small inventory system using postgresql and psycopg2. Everything works great, except, when I want to create aggregated summaries/reports of the content, I get really bad performance due to count()’ing and sorting.

The DB schema is as follows:

CREATE TABLE hosts
(
        id SERIAL PRIMARY KEY,
        name VARCHAR(255)
);
CREATE TABLE items
(
        id SERIAL PRIMARY KEY,
        description TEXT
);
CREATE TABLE host_item
(
        id SERIAL PRIMARY KEY,
        host INTEGER REFERENCES hosts(id) ON DELETE CASCADE ON UPDATE CASCADE,
        item INTEGER REFERENCES items(id) ON DELETE CASCADE ON UPDATE CASCADE
);

There are some other fields as well, but those are not relevant.

I want to extract 2 different reports:
– List of all hosts with the number of items per, ordered from highest
to lowest count
– List of all items with the number of hosts per, ordered from highest to lowest count

I have used 2 queries for the purpose:

Items with host count:

SELECT i.id, i.description, COUNT(hi.id) AS count
FROM items AS i
LEFT JOIN host_item AS hi
ON (i.id=hi.item)
GROUP BY i.id
ORDER BY count DESC
LIMIT 10;

Hosts with item count:

SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;

Problem is: the queries runs for 5-6 seconds before returning any data. As this is a web based application, 6 seconds are just not acceptable. The database is heavily populated with approximately 50k hosts, 1000 items and 400 000 host/items relations, and will likely increase significantly when (or perhaps if) the application will be used.

After playing around, I found that by removing the “ORDER BY count DESC” part, both queries would execute instantly without any delay whatsoever (less than 20ms to finish the queries).

Is there any way I can optimize these queries so that I can get the result sorted without the delay? I was trying different indexes, but seeing as the count is computed it is possible to utilize an index for this. I have read that count()’ing in postgresql is slow, but its the sorting that are causing me problems…

My current workaround is to run the queries above as an hourly job, putting the result into a new table with an index on the count column for quick lookup.

I use Postgresql 9.2.

Update: Query plan as ordered 🙂

EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;


 Limit  (cost=699028.97..699028.99 rows=10 width=21) (actual time=5427.422..5427.424 rows=10 loops=1)
   ->  Sort  (cost=699028.97..699166.44 rows=54990 width=21) (actual time=5427.415..5427.416 rows=10 loops=1)
         Sort Key: (count(hi.id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=613177.95..697840.66 rows=54990 width=21) (actual time=3317.320..5416.440 rows=54990 loops=1)
               ->  Merge Left Join  (cost=613177.95..679024.94 rows=3653163 width=21) (actual time=3317.267..5025.999 rows=3653163 loops=1)
                     Merge Cond: (h.id = hi.host)
                     ->  Index Scan using hosts_pkey on hosts h  (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.012..15.693 rows=54990 loops=1)
                     ->  Materialize  (cost=613177.95..631443.77 rows=3653163 width=8) (actual time=3317.245..4370.865 rows=3653163 loops=1)
                           ->  Sort  (cost=613177.95..622310.86 rows=3653163 width=8) (actual time=3317.199..3975.417 rows=3653163 loops=1)
                                 Sort Key: hi.host
                                 Sort Method: external merge  Disk: 64288kB
                                 ->  Seq Scan on host_item hi  (cost=0.00..65124.63 rows=3653163 width=8) (actual time=0.006..643.257 rows=3653163 loops=1)
 Total runtime: 5438.248 ms





EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
LIMIT 10;


 Limit  (cost=0.00..417.03 rows=10 width=21) (actual time=0.136..0.849 rows=10 loops=1)
   ->  GroupAggregate  (cost=0.00..2293261.13 rows=54990 width=21) (actual time=0.134..0.845 rows=10 loops=1)
         ->  Merge Left Join  (cost=0.00..2274445.41 rows=3653163 width=21) (actual time=0.040..0.704 rows=581 loops=1)
               Merge Cond: (h.id = hi.host)
               ->  Index Scan using hosts_pkey on hosts h  (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.015..0.021 rows=11 loops=1)
               ->  Index Scan Backward using idx_host_item_host on host_item hi  (cost=0.00..2226864.24 rows=3653163 width=8) (actual time=0.005..0.438 rows=581 loops=1)
 Total runtime: 1.143 ms

Update: All the answers to this question is really good for learning and understanding how Postgres works. There does not seem to be any definite solution to this problem, but I really appreciate all the excellent answers you have provided, and I will use those in my future work with Postgresql. Thanks alot guys!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T01:50:16+00:00

@Gordon and @willglynn have provided a lot of useful background as to why your query is slow.

A workaround would be to add a counter to the tables items and hosts and triggers that keep them up to date – for a non-trivial cost to write operations.
Or use materialized views like you do. I might opt for that.

For that, you still need to execute these queries on a regular basis and they can be improved. Rewrite your first one to:

SELECT id, i.description, hi.ct
FROM   items i
JOIN  (
    SELECT item AS id, count(*) AS ct
    FROM   host_item
    GROUP  BY item
    ORDER  BY ct DESC
    LIMIT  10
    ) hi USING (id);

If there is a row in table items for most rows in table host_item, it is faster to aggregate first and then JOIN. Contrary to what @willglynn speculates, this is not optimized automatically in Postgres 9.1.
count(*) is faster than count(col) on principal – and equivalent while col cannot be NULL. (A LEFT JOIN might introduce NULL values.)
Simplified LEFT JOIN to JOIN. It should be safe to assume that there are always at least ten distinct hosts. Doesn’t matter much for your original query, but it’s a requirement for this one.
Indexes on table host_item won’t help, and the PK on items covers the rest.

Probably still not good enough for your case, but in my tests with Postgres 9.1 this form is more than twice as fast. Should translate to 9.2, but test with EXPLAIN ANALYZE to be sure.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have built a small inventory system using postgresql and psycopg2. Everything works great,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply