I have two tables:
CREATE TABLE items
(
root_id integer NOT NULL,
id serial NOT NULL,
-- Other fields...
CONSTRAINT items_pkey PRIMARY KEY (root_id, id)
)
CREATE TABLE votes
(
root_id integer NOT NULL,
item_id integer NOT NULL,
user_id integer NOT NULL,
type smallint NOT NULL,
direction smallint,
CONSTRAINT votes_pkey PRIMARY KEY (root_id, item_id, user_id, type),
CONSTRAINT votes_root_id_fkey FOREIGN KEY (root_id, item_id)
REFERENCES items (root_id, id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
-- Other constraints...
)
I’m trying to, in a single query, pull out all items of a particular root_id along with a few arrays of user_ids of the users who voted in particular ways. The following query does what I need:
SELECT *,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = 1) as upvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = -1) as downvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 1
ORDER BY id
The problem is that I’m using three subqueries to get the information I need when it seems like I should be able to do the same in one. I thought that Postgres (I’m using 8.4) might be smart enough to collapse them all into a single query for me, but looking at the explain output in pgAdmin it looks like that’s not happening – it’s running multiple primary key lookups on the votes table instead. I feel like I could rework this query to be more efficient, but I’m not sure how.
Any pointers?
EDIT: An update to explain where I am now. At the advice of the pgsql-general mailing list, I tried changing the query to use a CTE:
WITH v AS (
SELECT item_id, type, direction, array_agg(user_id) as user_ids
FROM votes
WHERE root_id = 5305
GROUP BY type, direction, item_id
ORDER BY type, direction, item_id
)
SELECT *,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = 1) as upvoters,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = -1) as downvoters,
(SELECT user_ids from v where item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 5305
ORDER BY id
Benchmarking each of these from my application (I set up each as a prepared statement to avoid spending time on query planning, and then ran each one several thousand times with a variety of root_ids) my initial approach averages 15 milliseconds and the CTE approach averages 17 milliseconds. I was able to repeat this result over a few runs.
When I have some time I’m going to play with jkebinger’s and Dragontamer5788’s approaches with my test data and see how they work, but I’m also starting a bounty to see if I can get more suggestions.
I should also mention that I’m open to changing my schema (the system isn’t in production yet, and won’t be for a couple months) if it can speed up this query. I designed my votes table this way to take advantage of the primary key’s uniqueness constraint – a given user can both favorite and upvote an item, for example, but not upvote it AND downvote it – but I can relax/work around that constraint if representing these options in a different way makes more sense.
EDIT #2: I’ve benchmarked all four solutions. Amazingly, Sequel is flexible enough that I was able to write all four without dropping to SQL once (not even for the CASE statements). Like before, I ran them all as prepared statements, so that query planning time wouldn’t be an issue, and did each run several thousand times. Then I ran all the queries under two situations – a worst-case scenario with a lot of rows (265 items and 4911 votes) where the relevant rows would be in the cache pretty quickly, so CPU usage should be the deciding factor and a more realistic scenario where a random root_id was chosen for each run. I wound up with:
Original query - Typical: ~10.5 ms, Worst case: ~26 ms
CTE query - Typical: ~16.5 ms, Worst case: ~70 ms
Dragontamer5788 - Typical: ~15 ms, Worst case: ~36 ms
jkebinger - Typical: ~42 ms, Worst case: ~180 ms
I suppose the lesson to take from this right now is that Postgres’ query planner is very smart and is probably doing something clever under the surface. I don’t think I’m going to spend any more time trying to work around it. If anyone would like to submit another query attempt I’d be happy to benchmark it, but otherwise I think Dragontamer is the winner of the bounty and correct (or closest to correct) answer. Unless someone else can shed some light on what Postgres is doing – that would be pretty cool. 🙂
There are two questions being asked:
For #1, I can’t get the “complete” thing into a single Common Table Expression, because you’re using a correlated subquery on each item. Still, you might have some benefits if you used a common table expression. Obviously, this will depend on the data, so please benchmark to see if it would help.
For #2, because there are three commonly accessed “classes” of items in your table, I expect partial indexes to increase the speed of your query, regardless of whether or not you were able to increase the speed due to #1.
First, the easy stuff then. To add a partial index to this table, I’d do:
The smaller these indexes, the more efficient your queries will be. Unfortunately, in my tests, they didn’t seem to help 🙁 Still, maybe you can find a use of them, it depends greatly on your data.
As for an overall optimization, I’d approach the problem differently. I’d “unroll” the query into this form (using an inner join and using conditional expressions to “split up” the three types of votes), and then use “Group By” and the “array” aggregate operator to combine them. IMO, I’d rather change my application code to accept it in the “unrolled” form, but if you can’t change the application code, then the “group by”+aggregate function ought to work.
Its still “one step unrolled” compared to your code (vote_type is vertical, while in your case, its horizontal, across the columns). But this seems to be more efficient.