Asked: June 11, 20262026-06-11T18:23:44+00:00 2026-06-11T18:23:44+00:00

I searched a lot in stackoverflow and Google but I didn’t find the best

I searched a lot in stackoverflow and Google but I didn’t find the best answer for this.
Actually, I’m going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)

I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.

What’s the best algorithm for doing this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team
2026-06-11T18:23:45+00:00Added an answer on June 11, 2026 at 6:23 pm
A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.

This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.
0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I searched a lot in stackoverflow and Google but I didn’t find the best

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply