I wonder how it is possible to (more or less ) reliably clip the

Question

0

Asked: May 25, 20262026-05-25T12:10:08+00:00 2026-05-25T12:10:08+00:00

I wonder how it is possible to (more or less ) reliably clip the

0

I wonder how it is possible to (more or less ) reliably clip the content from a random web site (using Ruby or JavaScript, doesn’t really matter).

Much like Evernote and Flipboard do.

What is the best way to determine where the actual content is within a page?

The purpose: given a URL – retrieve the actual content of that page and ignore all the layout and other unrelated information.

For example:

given http://ninemsn.com/ => the HTML of the main news topic that is in the middle part of the content.
given the http://news.cnet.com/8301-1035_3-20104048-94/a-beginners-guide-to-telecom-jargon-part-7 => the HTML of the main article.

Just use Evernote’s “clip full page” option to see exactly what I mean.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T12:10:08+00:00

My initial thoughts would be to DOM parse the page, then traverse the DOM tree to the content of a specific div and show that (via XPath, etc). For pages without clearly-defined sections it’s going to be difficult regardless of which method you use. The AutoPager plugin for Firefox and Chrome implements XPath parsing behaviour. Get the latest version and open up the .xpi to see how he does it. It’s a JavaScript implementation.

Pick the div by letting someone enter, per URL/site scheme, what the id or class of the content div is. For your ninemsn example, the div containing the article’s title, share buttons, the author’s image, and the post content is

<div class="post">

and the actual body of the text is

<div class="postBody txtWrap" section="txt">

So someone would enter that you need to parse the first h1 from <div class="post"> and that’s the article title, and then get all the text from <div class="postBody"> and make that the article content (you might need to parse the class in such a way that it can match both postBody and txtWrap).

Another example (for funsies): Stack Overflow. A question’s title is contained in

<div id="question-header">

A question’s text is trickier, because it’s in a div with the same class as an answer’s text, and no id. You need to match <div id="question"> and then traverse down to

<div class="post-text">

Similarly for answers, each <div id="answer-[UINTEGER]"> contains a <div class="post-text"> with its respective text.

In both situations, you can traverse those top-level question and answer- divs for <div class="user-details"> to fetch usernames, reputation and badge counts, etc.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I wonder how it is possible to (more or less ) reliably clip the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply