I’m working on a school project in which we would like to analyze the

Question

0

Asked: May 18, 20262026-05-18T05:19:43+00:00 2026-05-18T05:19:43+00:00

I’m working on a school project in which we would like to analyze the

0

I’m working on a school project in which we would like to analyze the content of webpages. We don’t, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.

I feel like it’s a bit much to hope for, so I won’t be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I’ve had a bit of luck diffing pages with others from the same site, but it’s imperfect and leaves comments and such.

I am working in Java, but would welcome anything open source in any language that I can use for ideas.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T05:19:44+00:00

Editorial Team

2026-05-18T05:19:44+00:00Added an answer on May 18, 2026 at 5:19 am

You could try an unofficial API of arc90’s Readability.

Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a school project in which we would like to analyze the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply