Last year I dabbed in a bit of perl programming. The first thing I wrote was a simple script that took a web page and found out how many times a word or name was on that page. I refer to this as “crawling” is that correct?. I was wondering If this is a native process for other languages like PHP and ROR. Essentially I want to build my own “API” for a site without a public “API” and possibly pass in the keywords dynamically from another “API” from another site (just for reading and organizing public data). Sorry for the high level of abstraction my head has just been in the clouds lately.
Share
Your problem is very tractible, and in fact many people/companies have done it already, but alas you are a long was off still. Loosely speaking “Crawling” usually refers to a breadth or depth first search of the internet using anchor tags in html pages as the edges between nodes.
What you did in perl was basically just searched an html string.
For your API I would suggest finding a DOM parser so that you don’t have to bother messing with parsing html strings and the inherent errors that produces.
A few years back I wanedt to generate some data for apartment prices regions of Massachusetts so I wrote a bit of a crawler to extract all of the apartment listing on craigslist and toss them in a DB.
If anyone is interested I can go on, but it’s outside the scope of this answer.
Ohh yea, and it was in PHP…