Try .link { background: url("images/image1.gif") top right no-repeat; padding-right: 32px;…

Question

0

Asked: May 14, 20262026-05-14T05:16:14+00:00 2026-05-14T05:16:14+00:00

I created a little web spider in Python which I’m using to collect URLs.

0

I created a little web spider in Python which I’m using to collect URLs. I’m not interested in the content. Right now I’m keeping all the visited URLs in a set in memory, because I don’t want my spider to visit URLs twice. Of course that’s a very limited way of accomplishing this.

So what’s the best way to keep track of my visited URLs?

Should I use a database?

which one? MySQL, SQLite, PostgreSQL?
how should I save the URLs? As a primary key trying to insert every URL before visiting it?

Or should I write them to a file?

one file?
multiple files? how should I design the file-structure?

I’m sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T05:16:15+00:00

These seem to be the important aspects to me:

You can’t keep the URLs in memory as RAM will get too high
You need fast existence lookups at least O(logn)
You need fast insertions

There are many ways to do this and it depends on how big your database will get. I think an SQL database can provide a good model for your problem.

Probably all you need is an SQLite database. Typically string lookups for existence check is a slow operation. To speed this up you can create a CRC hash of the URL and store both the CRC and URL in your database. You would have an index on that CRC field.

When you insert: You insert the URL and the hash
When you want to do an existance lookup: You take the CRC of the potentially new URL and check if it is in your database already.

There is of course a chance of collision on the URL hashes, but if 100% spanning is not important to you then you can take the hit of not having a URL in your DB when there is a collision.

You could also decrease collisions in many ways. For example you can increase the size of your CRC (CRC8 instead of CRC4) and use a hashing algorithm with a bigger size. Or use CRC as well as URL length.

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions