I think you have the wrong idea of the page…

Question

0

Asked: May 12, 20262026-05-12T13:45:19+00:00 2026-05-12T13:45:19+00:00

I’m looking for an algorithm (or some other technique) to read the actual content

0

I’m looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I’m reading an RSS feed programatically from Google News. I’m interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of “noise”, whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.

Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I’m interested in is an algorithm (I’d even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.

A couple of additional notes:

Scraping the HTML is simply the first attempt I tried. I’m not sold that this is the best way to do things.
I don’t want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
I know whatever algorithm I end up with is not going to be perfect, but I’m interested in a best possible solution.

Any ideas?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T13:45:20+00:00

Editorial Team

2026-05-12T13:45:20+00:00Added an answer on May 12, 2026 at 1:45 pm

When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.

0

Reply
Share
Share

- Report

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions