I have two tables in PostgreSQL:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.*
FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I’ve read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables and i’am taking into consideration move my data to MongoDB. Everything would be easy, if i didn’t have to check hosts table during selecting data from urls table. I’ve heard that MongoDB doesn’t support joins, so my question is how to solve above problem? I could put information about host in urls collection, but the field hosts.is_spam could be updated by user and i would have to update the whole urls collection. I don’t know it it is right solution.
I would be greatful for any advices.
You are correct in that the problem is the join, but my guess is that it’s just the wrong kind of join. As Frank H. mentioned, PostgreSQL should be able to process this type of query rather handily depending on the frequency of
hosts.is_spam. You probably want to cluster theurlstable onidto optimize the order by-limit phase. Since you only care abouturls.*you can minimize disk io by creating a partial index onhosts.hostwhereis_spam is not nullto make it easy to get just the short list of hosts to avoid.Try this:
Or this:
This will allow PostgreSQL to use an anti-join to pull only urls which are not mapped to a known spammy host. The results may be different from your query if there are urls with a null or invalid host.