I searched a lot in stackoverflow and Google but I didn’t find the best answer for this.
Actually, I’m going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)
I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.
What’s the best algorithm for doing this?
A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.
This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.