I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don’t need it to search for links or related data, just gather all data from websites that will have changing content daily.
I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.
Thanks.
Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?
Python probably, or Perl.
Perl has a very nice LWP (Library for WWW in Perl), Python has urllib2.
Both are easy scripting languages available on most OSs.
I’ve done a crawler in Perl quite a few times, it’s an evening of work.
And no, they can’t really protect themselves from crawlers, except for using CAPTCHA of sort – everything else is easier to crack than to set up.
There was a point about Java: Java is fine. It’s more verbose and requires some development environment setup: so you wouldn’t do it in one evening, probably a week.
For a small task, which question author indicated, that might be an overkill.
On other hand, there are very useful libraries like
lint,tagsoup(DOM traversal for random HTML out there) andlucene(full text indexing and search), so you might want Java for more serious projects.In this case, I’d recommend
Apache commons-httpclientlibrary for web-crawling (ornutchif you’re crazy :).Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.