I have a web crawler that looks for specific information I want and returns it. This is run daily.
The issue is that my crawler has to do two things.
- Get the link it has to crawl.
- Crawl said link and push stuff to the db.
The issue with #1 is, there are 700+ links in total. These links don’t change VERY frequently – maybe once a month?
So one option is just to do a separate crawl for the ‘list of links’, once a month, and dump the links into the db.
Then, have the crawler do a db hit for each of those 700 links every day.
Or, I can just have a nested crawl within my crawler – where every single time the crawler is run (daily), it updates this list of 700 URLs and stores it in an array and pulls it from this array to do crawl each link.
Which is more efficient and be less taxing on Heroku – or whichever host?
It depends on how you measure “efficiency” and “taxing”, but the local database hit is almost certain to be faster and “better” than an HTTP request + parsing an HTML(?) response for the links.
Further, not that it likely matters, but (assuming your database and adapter support it) you can begin to iterate through the DB request results and process them without waiting for or fetching the entire set into memory.
Network latency and resources are going to be much worse than poking at a DB that is already sitting there, running, and designed to be queried efficiently and quickly.
However: once per day? Is there a good reason to spend any energy optimizing this task?