I wonder how it is possible to (more or less ) reliably clip the content from a random web site (using Ruby or JavaScript, doesn’t really matter).
Much like Evernote and Flipboard do.
What is the best way to determine where the actual content is within a page?
The purpose: given a URL – retrieve the actual content of that page and ignore all the layout and other unrelated information.
For example:
- given http://ninemsn.com/ => the HTML of the main news topic that is in the middle part of the content.
- given the http://news.cnet.com/8301-1035_3-20104048-94/a-beginners-guide-to-telecom-jargon-part-7 => the HTML of the main article.
Just use Evernote’s “clip full page” option to see exactly what I mean.
Thanks.
My initial thoughts would be to DOM parse the page, then traverse the DOM tree to the content of a specific
divand show that (via XPath, etc). For pages without clearly-defined sections it’s going to be difficult regardless of which method you use. The AutoPager plugin for Firefox and Chrome implements XPath parsing behaviour. Get the latest version and open up the.xpito see how he does it. It’s a JavaScript implementation.Pick the div by letting someone enter, per URL/site scheme, what the
idorclassof the contentdivis. For your ninemsn example, the div containing the article’s title, share buttons, the author’s image, and the post content isand the actual body of the text is
So someone would enter that you need to parse the first
h1from<div class="post">and that’s the article title, and then get all the text from<div class="postBody">and make that the article content (you might need to parse the class in such a way that it can match bothpostBodyandtxtWrap).Another example (for funsies): Stack Overflow. A question’s title is contained in
A question’s text is trickier, because it’s in a
divwith the sameclassas an answer’s text, and noid. You need to match<div id="question">and then traverse down toSimilarly for answers, each
<div id="answer-[UINTEGER]">contains a<div class="post-text">with its respective text.In both situations, you can traverse those top-level
questionandanswer-divs for<div class="user-details">to fetch usernames, reputation and badge counts, etc.