I need to quickly select a value ( baz ) from the “earliest” ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).
The foos table contains about 55k rows and the samples table contains about 25 million rows.
CREATE TABLE foos (
foo_id int,
val varchar(40),
# ref_id is a FK, constraint omitted for brevity
ref_id int
)
CREATE TABLE samples (
sample_id int,
save_date date,
baz smallint,
# foo_id is a FK, constraint omitted for brevity
foo_id int
)
WITH foo ( foo_id, val ) AS (
SELECT foo_id, val FROM foos
WHERE foos.ref_id = 1
ORDER BY foos.val ASC
LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
SELECT A.baz, A.foo_id
FROM samples A
INNER JOIN (
SELECT foo_id, MIN( save_date ) AS save_date
FROM samples
GROUP BY foo_id
) B
USING ( foo_id, save_date )
) firsts USING ( foo_id )
This query currently takes over 100 seconds; I’d like to see this return in ~1 second (or less!).
How can I write this query to be optimal?
Updated; adding explains:
Obviously the actual query I’m using isn’t using tables foo, baz, etc.
The “dumbed down” example query’s (from above) explain:
Hash Right Join (cost=337.69..635.47 rows=3 width=100)
Hash Cond: (a.foo_id = foo.foo_id)
CTE foo
-> Limit (cost=71.52..71.53 rows=3 width=102)
-> Sort (cost=71.52..71.53 rows=3 width=102)
Sort Key: foos.val
-> Seq Scan on foos (cost=0.00..71.50 rows=3 width=102)
Filter: (ref_id = 1)
-> Hash Join (cost=265.25..562.90 rows=9 width=6)
Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
-> Seq Scan on samples a (cost=0.00..195.00 rows=1850 width=10)
-> Hash (cost=244.25..244.25 rows=200 width=8)
-> HashAggregate (cost=204.25..224.25 rows=200 width=8)
-> Seq Scan on samples (cost=0.00..195.00 rows=1850 width=8)
-> Hash (cost=0.60..0.60 rows=3 width=102)
-> CTE Scan on foo (cost=0.00..0.60 rows=3 width=102)
If I understand the question, you want windowing.
Using
row_numberinstead ofrankeliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, justLEFT JOINthe foos table to this query.With an index on
(foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.