My crawler is crawling all websites and getting metadata information from them. I then

Question

0

Asked: May 23, 20262026-05-23T16:46:52+00:00 2026-05-23T16:46:52+00:00

My crawler is crawling all websites and getting metadata information from them. I then

0

My crawler is crawling all websites and getting metadata information from them.
I then will run a script to sanitize the URLs and store them in Amazon RDS.

My problem is what datastore should I use to store data for sanitization purpose (Delete unwanted URLs). I don’t want the crawler to hit the Amazon RDS which would slow it down.

Should I be using Amazon SimpleDB? Then I can read from SimpleDB, sanitize the URL and move it to Amazon RDS.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T16:46:53+00:00

You can always use a db, but the issue is with the disk access. Everytime you would be doing a disk access to read a bunch of URLs sanitize them and again write them to another db which is another disk access. This process is OK if you aren’t concerned about performance.

One solution is you can use any data structure as simple as a list, store a bunch or URLs have a thread which wakes up when the list hits a threshold cleans up the URLs and then you can write these URLs to the Amazon RDS.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My crawler is crawling all websites and getting metadata information from them. I then

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply