I’m pretty ignorant of what appears in the html/javascript of a website because I spend most of my time on the back-end (phrasing!). Basically, I want to know the best way to take a company’s url, e.g. PETA, and from that url parse out descriptive words about the company from their front-page html. This way you can jump-start an auto-tagging categorization website with just a list of company urls.
If this is reasonable, any recommendations for tools/processes to find/mine the content would be much welcomed.
And if not or you have a better idea to get the tags, let it be known as well!
Mike Swift is too correct — if you’re looking for categorization only, then all you need to do is parse out DMOZ categorizations. The amazon service uses DMOZ to get the categories anyway, and it’s free (unlike AWIS). For example, parse out this link to get the categories for PETA.
If you’re looking for parsing tools, I’ve quite enjoyed Nokogiri, but any web-parsing tool like BeautifulSoup works. I would parse it with something like:
Hope that helps!