I’m working on a school project in which we would like to analyze the content of webpages. We don’t, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it’s a bit much to hope for, so I won’t be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I’ve had a bit of luck diffing pages with others from the same site, but it’s imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.
You could try an unofficial API of arc90’s Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.