I am figuring the best way how to check if two or more url

Question

0

Asked: May 31, 20262026-05-31T18:41:49+00:00 2026-05-31T18:41:49+00:00

I am figuring the best way how to check if two or more url

0

I am figuring the best way how to check if two or more url duplicated in the case they have some extra parameters like the code below. In fac, url1 and url2 is same, but when running the webspider, it will treat as two separate url and the result would be duplicated.

from urllib2 import urlopen
import hashlib

url1 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html?xid=gonewssedit')
u1 = hashlib.md5(u1).hexdigest() 
url2 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html')
u2 = hashlib.md5(u2).hexdigest() 
if u1 == u2:
    print 'yes'
else:
    print 'no'

In short, I will generate the md5 hash by using the url header, then store it in the database, then when I crawl the new url I can check if it is duplicated or not. But I am not sure it is the best way to this work in Python.

Thank you very much

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T18:41:50+00:00

The result of the web page may be the same or different depending on the ‘extra parameters’. So, in general, you cannot define rules that detect duplicate content only by looking at the url.

I would suggest to treat url1 and url2 as different.Compute a md5sum of each block of say 1024 words received from the urls. Maintain a hash map of these md5sums to be able to detect duplicates.

Probably some web crawling tools might offer some of the features you need.

Update based on OP’s comments: I wrote some code to enhance my answer. There are two versions: the first one is simpler:

def find_matches():
    """
        Basic version: reads urls, but does not consider the semantic information of
        HTML header, body, etc. while computing duplicates.
    """

    from urllib2 import urlopen
    import hashlib

    urls = [ 'http://www.google.com', 'http://www.google.com/search']

    d = {}
    url_contents = {}
    matches = []
    for url in urls:
        c = urlopen(url)
        url_contents[url] = []
        while 1:
            r = c.read(4096)
            if not r: break
            md5 = hashlib.md5(r).hexdigest()
            url_contents[url].append(md5)
            if md5 in d:
                url2 = d[md5]
                matches.append((md5, url, url2))
            else:
                d[md5] = []
            d[md5].append(url)
    #print url_contents
    print matches

if __name__ == '__main__':
    find_matches()

It was naive to expect the above code to detect duplicates in the expected way: the current web pages are much too complex. Therefore, even two urls that are the same to the eyes of a user actually have many differences due to ads, hash tags, self-url-name inclusion, etc.

The second version is more sophisticated. It introduces a limited semantic analysis of the content based on BeautifulSoup:

def find_matches():
    """
        Some consideration of the HTML header, body, etc. while computing duplicates.
    """

    from urllib2 import urlopen
    import hashlib
    from BeautifulSoup import BeautifulSoup
    import pprint

    urls = [ 'http://www.google.com', 'http://www.google.com/search'] # assuming all distinct urls

    def txt_md5(txt):
        return hashlib.md5(txt).hexdigest()

    MAX_FILE_SIZE = 1024*1024*1024 
    d = {}
    url_contents = {}
    matches = []
    for url in urls:
        try:
            c = urlopen(url)
            url_contents[url] = []
            r = c.read(MAX_FILE_SIZE)
            soup = BeautifulSoup(r)
            header = soup.find('head').text
            body = soup.find('body').text 
            # More fine-grained content options 
            # like h1, h2, p, etc., can be included.
            # Common CSS tags like page, content, etc.
            # can also be included.
            for h in [header, body]:
                print h
                md5 = txt_md5(h)
                url_contents[url].append((md5, h))
                if md5 in d:
                    url2 = d[md5]
                    matches.append((md5, url, url2))
                else:
                    d[md5] = []
                d[md5].append(url)
        except Exception as e:
            print "Exception", e
    print '---------------'
    #pprint.pprint(url_contents)
    print matches

if __name__ == '__main__':
    find_matches()

However, the second version too does not work. The reason remains the same. Indeed the difference between the head texts of the two urls was an included hash value, and the difference between the body texts of the two urls was a string webhp. I used difflib.context_diff to compute the difference.

It is possible to enhance the code to include a third version that parses the web pages more intelligently and computes the diff more intelligently. For example, declaring as duplicates even the texts with <5% diff (this ratio can be easily computed using a difflib function).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am figuring the best way how to check if two or more url

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply