I created a little web spider in Python which I’m using to collect URLs. I’m not interested in the content. Right now I’m keeping all the visited URLs in a set in memory, because I don’t want my spider to visit URLs twice. Of course that’s a very limited way of accomplishing this.
So what’s the best way to keep track of my visited URLs?
Should I use a database?
- which one? MySQL, SQLite, PostgreSQL?
- how should I save the URLs? As a primary key trying to insert every URL before visiting it?
Or should I write them to a file?
- one file?
- multiple files? how should I design the file-structure?
I’m sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?
These seem to be the important aspects to me:
There are many ways to do this and it depends on how big your database will get. I think an SQL database can provide a good model for your problem.
Probably all you need is an SQLite database. Typically string lookups for existence check is a slow operation. To speed this up you can create a CRC hash of the URL and store both the CRC and URL in your database. You would have an index on that CRC field.
There is of course a chance of collision on the URL hashes, but if 100% spanning is not important to you then you can take the hit of not having a URL in your DB when there is a collision.
You could also decrease collisions in many ways. For example you can increase the size of your CRC (CRC8 instead of CRC4) and use a hashing algorithm with a bigger size. Or use CRC as well as URL length.