What is the easiest way to programmatically extract structured data from a bunch of web pages?
I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.
Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?
If you do a search on Stackoverflow for
WWW::Mechanize&pQueryyou will see many examples using these Perl CPAN modules.However because you have mentioned “non-programmer” then perhaps
Web::ScraperCPAN module maybe more appropriate? Its more DSL like and so perhaps easier for “non-programmer” to pick up.Here is an example from the documentation for retrieving tweets from Twitter: