What are some of techniques good for detecting if a webpage is the same as another?
By same, I don’t mean char-for-char equivalent (that’s easy), but is robust enough to ignore something like a current date/time on the page, etc.
E.g., go a Yahoo! News article load the page, open the same page 10 minutes later in another browser. Baring rewrites, those pages will have some differences (timestamps, possibly things like ads, possibly things like related stories), but a human could look at the two and say they’re the same.
Note I’m not trying to fix (or rely) on URL normalization. I.e., figuring out that foo.html & foo.html?bar=bang are the same.
It sounds like you are after a robust way to measure the similarity of two pages.
Given that the structure of the page won’t change that much, we can reduce the problem to testing whether the text on the page is roughly the same. Of course, with this approach the problems alluded to by nickf regarding a photographers page are still there but if you are mainly concerned with Yahoo! news or the like this should be okay.
To compare to pages, you can use a method from machine learning called ‘string kernels’. Here’s an early paper a recent set of slides on a R package and a video lecture.
Very roughly, a string kernel looks for how many words, pairs of words, triples of words, etc two documents have in common. If A and B are two documents and k is a string kernel then the higher the value of k(A,B) the more similar the two documents.
If you set a threshold t and only say two documents are the same for k(A,B) > t you should have a reasonably good way of doing what you want. Of course, you’ll have to tune the threshold to get the best results for your application.