So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/ and create

Question

0

Asked: May 26, 20262026-05-26T16:46:02+00:00 2026-05-26T16:46:02+00:00

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/ and create

0

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.

I am thinking of using Hpricot, but am not too sure how to proceed.

How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?

You don’t have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.

Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?

If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?

Ideally I would like a script that will be able to tell the difference between JS and other code and just the ‘text’ and dump it (formatted with the proper headings and such).

Would really appreciate some guidance.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T16:46:02+00:00

I’d recomment using Nokogiri instead of Hpricot. It’s more robust, uses less resources, fewer bugs, it’s easier to use, and faster.

I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.

Check this RailsCast:

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

and:

http://nokogiri.org/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html

http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/ and create

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply