Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6550049
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T12:10:08+00:00 2026-05-25T12:10:08+00:00

I wonder how it is possible to (more or less ) reliably clip the

  • 0

I wonder how it is possible to (more or less ) reliably clip the content from a random web site (using Ruby or JavaScript, doesn’t really matter).

Much like Evernote and Flipboard do.

What is the best way to determine where the actual content is within a page?

The purpose: given a URL – retrieve the actual content of that page and ignore all the layout and other unrelated information.

For example:

  • given http://ninemsn.com/ => the HTML of the main news topic that is in the middle part of the content.
  • given the http://news.cnet.com/8301-1035_3-20104048-94/a-beginners-guide-to-telecom-jargon-part-7 => the HTML of the main article.

Just use Evernote’s “clip full page” option to see exactly what I mean.

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T12:10:08+00:00Added an answer on May 25, 2026 at 12:10 pm

    My initial thoughts would be to DOM parse the page, then traverse the DOM tree to the content of a specific div and show that (via XPath, etc). For pages without clearly-defined sections it’s going to be difficult regardless of which method you use. The AutoPager plugin for Firefox and Chrome implements XPath parsing behaviour. Get the latest version and open up the .xpi to see how he does it. It’s a JavaScript implementation.

    Pick the div by letting someone enter, per URL/site scheme, what the id or class of the content div is. For your ninemsn example, the div containing the article’s title, share buttons, the author’s image, and the post content is

    <div class="post">
    

    and the actual body of the text is

    <div class="postBody txtWrap" section="txt">
    

    So someone would enter that you need to parse the first h1 from <div class="post"> and that’s the article title, and then get all the text from <div class="postBody"> and make that the article content (you might need to parse the class in such a way that it can match both postBody and txtWrap).

    Another example (for funsies): Stack Overflow. A question’s title is contained in

    <div id="question-header">
    

    A question’s text is trickier, because it’s in a div with the same class as an answer’s text, and no id. You need to match <div id="question"> and then traverse down to

    <div class="post-text">
    

    Similarly for answers, each <div id="answer-[UINTEGER]"> contains a <div class="post-text"> with its respective text.

    In both situations, you can traverse those top-level question and answer- divs for <div class="user-details"> to fetch usernames, reputation and badge counts, etc.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I wonder if it's possible to assign a default value for web service request
I wonder if it's possible to load jQuery within Opera User JavaScript , so
I'm playing with Stripes and I wonder if it is possible to send Javascript
i wonder if it's possible to define a variable/property of more than one type.
I'm using std::random_shuffle and srandom, and wonder if it's possible to constrain srandom()'s effect
I wonder if it is possible to concurrent receive the message from one sender,
I'm using fullcalender with jQuery and I'm wonder if its possible, if my array
I wonder, is it possible to achieve similar using bit operations: if a >
I wonder is it possible to get cookies under another domain rather than my
I wonder if it is possible to show OSM ( Open Street Maps )

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.