PragProg have two relevant books in beta: their iPhone book…

Question

0

Asked: May 11, 20262026-05-11T22:22:05+00:00 2026-05-11T22:22:05+00:00

I need to build a content gathering program that will simply read numbers on

0

I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don’t need it to search for links or related data, just gather all data from websites that will have changing content daily.

I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.

Thanks.

Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T22:22:05+00:00

Python probably, or Perl.

Perl has a very nice LWP (Library for WWW in Perl), Python has urllib2.

Both are easy scripting languages available on most OSs.

I’ve done a crawler in Perl quite a few times, it’s an evening of work.

And no, they can’t really protect themselves from crawlers, except for using CAPTCHA of sort – everything else is easier to crack than to set up.

There was a point about Java: Java is fine. It’s more verbose and requires some development environment setup: so you wouldn’t do it in one evening, probably a week.
For a small task, which question author indicated, that might be an overkill.
On other hand, there are very useful libraries like lint, tagsoup (DOM traversal for random HTML out there) and lucene (full text indexing and search), so you might want Java for more serious projects.
In this case, I’d recommend Apache commons-httpclient library for web-crawling (or nutch if you’re crazy :).

Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions