Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 519491
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T08:02:48+00:00 2026-05-13T08:02:48+00:00

There’s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser

  • 0

There’s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I’m not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?

Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I’m looking for.

Postscript the first: To be precise, the kind of survey I’m after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.

Postscript the second To be clear, after Peter Rowell’s answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T08:02:48+00:00Added an answer on May 13, 2026 at 8:02 am

    Extraction can mean different things to different people. It’s one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won’t tell you what is cruft and what is meat.

    Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I’m interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can’t begin to do the interesting stuff — co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. — until you have gotten rid of the cruft.

    The first paper referenced by the OP indicates that this was what they were trying to achieve — analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat — but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:

    Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine ‘author intent’ of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.

    BTW, if you’re dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.


    For the few who might be interested, here’s some backstory (probably Off Topic, but I’m in that kind of mood tonight):

    In the 80’s and 90’s our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the ‘meaning’ of the document. Right. They found us because we were this weird little company doing “content similarity searching” in 1986. We gave them a couple of demos (real, not faked) which freaked them out.

    One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it’s own special scanner to deal with those differences. For example, if all you’re doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style – the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.

    How about magazine articles? Oh God, don’t get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don’t.

    Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.

    Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it’s linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It’s a positively brilliant insight and I wish I had had it myself. But it wouldn’t have done my customers any good because there were no links from last night’s Moscow TV listings to some random teletype message they had captured, or to some badly OCR’d version of an Egyptian newspaper.

    /mini-rant-and-trip-down-memory-lane

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

There is a directed graph having a single designated node called root from which
There are two intents on the receiver side which are called from the same
There are a lot of blogs saying that a hasOwnProperty check should be used
There are a lot of Jquery plugins and libraries. and sometimes I want to
There are a lot of questions about full-joining in mysql(5.1.36). Of course the solution
there can be two productions from which we can do the reduction. After giving
I know there's a lot of other questions out there that deal with this
there's a drop down html element for date for the current month (January 1
There are many tutorials that talk about deleting index.php from the url. But I
There is a conversion process that is needed when migrating Visual Studio 2005 web

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.