I’m creating a tool for scraping links from multiple URLs. I want to store this information, then test the scraped links for their status.
I am expecting having to test a lot of links, about 60,000. So the problem I have is deciding how to store the links to test.
What I’m thinking of doing is creating text files for the URLs I’ll be scraping. I’ll have to create about 40 text files for URLs I’ll be scraping(the URLs I’m scraping are the same URL, just regionalised).
- Would creating lots of text files cause performance issues?
- Would I be best off storing the URLs in an array and then writing the
array to the text file, or should I just write the URL to the text
file as I go? Or is there a better way? - Is there a better method than storing in text files? (I don’t really
want to use a database but if there is a good case for it I could be
convinced)
imho the easiest approach is to use serialization to save your information. For example, serialize
Map<String, Set<String>>of urls. Multiple files should work too, without any serious performance impact. But it’s slightly longer to implementAnother approach – register on mongolab and use free account. (It’s not advertising, I just like this service) You don’t need to install anything, just download mongo driver and go ahead