I have two tables:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I’ve read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?
Add an index on the
hosts.hostcolumn (primarily in thehoststable, this matters), and a composite index onurls.projects_id, urls.id, runANALYZEstatement to update all statistics and observe subsecond performance regardless of spam percentage.A slightly different advice would apply if almost everything is always spam and if the “projects”, whatever they are, are few in number and and very big each.
Explanation: update of statistics makes it possible for the optimizer to recognize that the
urlsandhoststables are both quite big (well, you didn’t show us schema, so we don’t know your row sizes). The composite index starting withprojects.idwill hopefully1 rule out most of theurlscontent, and its second component will immediately feed the rest ofurlsin the desired order, so it is quite likely that an index scan ofurlswill be the basis for the query plan chosen by the planner. It is then essential to have an index onhosts.hostto make the hosts lookups efficient; the majority of this big table will never be accessed at all.1) Here is where we assume that the
projects_idis reasonably selective (that it is not the same value throughout the whole table).