when retrieving and caching/saving (in a database) some posts from an rss feed, how to determine that:
- it is the same post (example: when some typos are fixed in the feed or if the title changes, the date changes, etc…)
- find feeds that talk about the same topic (example: same story from different sources)
are there any best practices for these things?
thnx a lot
Some RSS feeds have a guid element as an identifier. Posts with a shared guid are probably duplicates. Some RSS feeds just stuff the URL in there to indicate that a post’s uniqueness is tied to its url. Note that if the URL matches but the Guid does not, this may indicate that the posts are not duplicates. If a feed does not maintain an archive, the url might not change. This situation is probably pretty rare.