I am figuring the best way how to check if two or more url duplicated in the case they have some extra parameters like the code below. In fac, url1 and url2 is same, but when running the webspider, it will treat as two separate url and the result would be duplicated.
from urllib2 import urlopen
import hashlib
url1 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html?xid=gonewssedit')
u1 = hashlib.md5(u1).hexdigest()
url2 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html')
u2 = hashlib.md5(u2).hexdigest()
if u1 == u2:
print 'yes'
else:
print 'no'
In short, I will generate the md5 hash by using the url header, then store it in the database, then when I crawl the new url I can check if it is duplicated or not. But I am not sure it is the best way to this work in Python.
Thank you very much
The result of the web page may be the same or different depending on the ‘extra parameters’. So, in general, you cannot define rules that detect duplicate content only by looking at the url.
I would suggest to treat url1 and url2 as different.Compute a md5sum of each block of say 1024 words received from the urls. Maintain a hash map of these md5sums to be able to detect duplicates.
Probably some web crawling tools might offer some of the features you need.
Update based on OP’s comments: I wrote some code to enhance my answer. There are two versions: the first one is simpler:
It was naive to expect the above code to detect duplicates in the expected way: the current web pages are much too complex. Therefore, even two urls that are the same to the eyes of a user actually have many differences due to ads, hash tags, self-url-name inclusion, etc.
The second version is more sophisticated. It introduces a limited semantic analysis of the content based on BeautifulSoup:
However, the second version too does not work. The reason remains the same. Indeed the difference between the head texts of the two urls was an included hash value, and the difference between the body texts of the two urls was a string
webhp. I used difflib.context_diff to compute the difference.It is possible to enhance the code to include a third version that parses the web pages more intelligently and computes the diff more intelligently. For example, declaring as duplicates even the texts with <5% diff (this ratio can be easily computed using a difflib function).