I am having trouble with a slow transaction in Postgres trying to retrieve the latest prices of the catalogproducts that have a buy that is greater than their sell. It is a rather large table at this point, over 2 million rows. I have this for historical purposes. What I am currently using is:
select * from ta_price a
join (
select catalogproduct_id, max(timestamp) ts
from ta_price
group by catalogproduct_id
) b on a.catalogproduct_id = b.catalogproduct_id
and a.timestamp = b.ts
AND buy > sell;
catalogproduct_id is a Foreign Key to catalogproduct table.
Out of 2201760 total rows, it selects 2296 rows. The total runtime is 181,792.705 ms.
Any insight on how to improve this?
Edit:
I am blown away by all the answers! I want to also qualify this question more under the realm of the Django ORM. I am struggling to incorporate a composite key (or the like) on this table (using catalogproduct_id and timestamp). I have a primary key that is an autoincrementing index, which I guess is as good as not having none at all.
Edit 2:
After adding a partial index that @Erwin suggested,
CREATE INDEX my_partial_idx ON ta_price (catalogproduct_id, timestamp), I am using the query from @wildplasser for around 10-12 second query time. For further clarification, my table is snapshots of prices (buy and sell) of products over time. At any given time, I want to know what products currently (as of their latest snapshot time) have a
WHERE buy > sell;buy > sell.
Revised answer after some consideration
buyandsellare not qualified in your question. Depending on the selectivity ofbuy > sellyou can speed up the query by adding the sameWHERE-clause to the subselect.However, this yields different results. I add it on the off chance, that you might have overlooked it:
Either way, a simple index like @Will implies will help:
CREATE INDEX my_idx ON ta_price (catalogproduct_id, timestamp);There is a superior approach, though.
An unconditional
max()in the subselect will result in a sequential table scan regardless of indexes. Such an operation will never be fast with 2.2m rows.The
JOINcondition, combined with theWHEREclause of the outerSELECT, will profit from an index like the one above. Depending on the selectivity ofbuy > sella partial index will be a little or substantially faster and, correspondingly, smaller on disc and in RAM:The order of the columns in the index does not matter in this case. It will also speed ab my second variant of the query.
You mentioned the table was for “historic” purposes? If that means no new data, you could speed things up greatly with a materialized view.
On a side note: I would not use
timestampas a column name. It is allowed in PostgreSQL, but it’s a reserved word in all SQL standards.OK, first things last: for a table of 2.2m rows you need way more resources than postgres has out of the box.
shared_buffersandwork_memfor a start.Increase these statistics setting:
ALTER TABLE tmp.ta_price ALTER COLUMN buy SET STATISTICS 1000;ALTER TABLE tmp.ta_price ALTER COLUMN sell SET STATISTICS 1000;
ALTER TABLE tmp.ta_price ALTER COLUMN ts SET STATISTICS 1000;
Then run
ANALYZE tmp.ta_price;Be sure that autovacuum is running. If in doubt, run
VACUUM ANALYZE ta_priceand see if it had an effect.I have played with the test setup of wildplasser (which was very helpful!) on a pg 8.4 installation with limited ressources.
Here are the total runtimes fom
EXPLAIN ANYLYZEVariant 2 with the additional (buy > sell) clause:
With partial index:
Probably planner costs are off, this test db cluster is optimized for main db which
has way more resources.
Resumé
A.H.’s version takes much longer (same result as you reported). Window functions tend to be slow, especially on older versions of postgres. My alternative query is twice as fast, as expected. Question is, if the different results are desired – maybe not.
Anyway, that were 300k rows. Query takes 0.5 – 1s on version 8.4 with limited resources (but proper settings, mostly) on a 5 year old server. With a decent machine and decent settings (enough RAM!) you should bring it down to under 10s at least.