I have two tables in PostgreSQL: urls (table with indexed pages, host is indexed

Question

0

Asked: June 7, 20262026-06-07T09:51:11+00:00 2026-06-07T09:51:11+00:00

I have two tables in PostgreSQL: urls (table with indexed pages, host is indexed

0

I have two tables in PostgreSQL:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)

One of the most frequent SELECT in my application is:

SELECT urls.* 
FROM urls 
JOIN hosts ON urls.host = hosts.host 
WHERE urls.projects_id = ? 
  AND hosts.is_spam IS NULL 
ORDER by urls.id DESC, LIMIT ?

In projects which have more than 100 000 rows in urls table the query executes very slow.

Since the tables has grown the query is execution slower and slower. I’ve read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables and i’am taking into consideration move my data to MongoDB. Everything would be easy, if i didn’t have to check hosts table during selecting data from urls table. I’ve heard that MongoDB doesn’t support joins, so my question is how to solve above problem? I could put information about host in urls collection, but the field hosts.is_spam could be updated by user and i would have to update the whole urls collection. I don’t know it it is right solution.

I would be greatful for any advices.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T09:51:13+00:00

You are correct in that the problem is the join, but my guess is that it’s just the wrong kind of join. As Frank H. mentioned, PostgreSQL should be able to process this type of query rather handily depending on the frequency of hosts.is_spam. You probably want to cluster the urls table on id to optimize the order by-limit phase. Since you only care about urls.* you can minimize disk io by creating a partial index on hosts.host where is_spam is not null to make it easy to get just the short list of hosts to avoid.

Try this:

select urls.* 
from urls 
left join hosts 
   on urls.host = hosts.host 
   and hosts.is_spam is not null
where urls.projects_id = ? 
and hosts.host is null

Or this:

select * 
from urls
where urls.projects_id = ? 
and not exists (
   select 1
   from hosts
   where hosts.host = urls.hosts
   and hosts.is_spam is not null
)

This will allow PostgreSQL to use an anti-join to pull only urls which are not mapped to a known spammy host. The results may be different from your query if there are urls with a null or invalid host.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two tables in PostgreSQL: urls (table with indexed pages, host is indexed

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply