I’m pretty ignorant of what appears in the html/javascript of a website because I

Question

0

Asked: May 23, 20262026-05-23T10:55:21+00:00 2026-05-23T10:55:21+00:00

I’m pretty ignorant of what appears in the html/javascript of a website because I

0

I’m pretty ignorant of what appears in the html/javascript of a website because I spend most of my time on the back-end (phrasing!). Basically, I want to know the best way to take a company’s url, e.g. PETA, and from that url parse out descriptive words about the company from their front-page html. This way you can jump-start an auto-tagging categorization website with just a list of company urls.

If this is reasonable, any recommendations for tools/processes to find/mine the content would be much welcomed.

And if not or you have a better idea to get the tags, let it be known as well!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T10:55:21+00:00

Mike Swift is too correct — if you’re looking for categorization only, then all you need to do is parse out DMOZ categorizations. The amazon service uses DMOZ to get the categories anyway, and it’s free (unlike AWIS). For example, parse out this link to get the categories for PETA.

If you’re looking for parsing tools, I’ve quite enjoyed Nokogiri, but any web-parsing tool like BeautifulSoup works. I would parse it with something like:

Nokogiri::HTML(open('<site>'))
doc.css('ol.dir li a').map {|item| [item.content]}

Hope that helps!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m pretty ignorant of what appears in the html/javascript of a website because I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply