Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7683003
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T18:41:49+00:00 2026-05-31T18:41:49+00:00

I am figuring the best way how to check if two or more url

  • 0

I am figuring the best way how to check if two or more url duplicated in the case they have some extra parameters like the code below. In fac, url1 and url2 is same, but when running the webspider, it will treat as two separate url and the result would be duplicated.

from urllib2 import urlopen
import hashlib

url1 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html?xid=gonewssedit')
u1 = hashlib.md5(u1).hexdigest() 
url2 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html')
u2 = hashlib.md5(u2).hexdigest() 
if u1 == u2:
    print 'yes'
else:
    print 'no'

In short, I will generate the md5 hash by using the url header, then store it in the database, then when I crawl the new url I can check if it is duplicated or not. But I am not sure it is the best way to this work in Python.

Thank you very much

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T18:41:50+00:00Added an answer on May 31, 2026 at 6:41 pm

    The result of the web page may be the same or different depending on the ‘extra parameters’. So, in general, you cannot define rules that detect duplicate content only by looking at the url.

    I would suggest to treat url1 and url2 as different.Compute a md5sum of each block of say 1024 words received from the urls. Maintain a hash map of these md5sums to be able to detect duplicates.

    Probably some web crawling tools might offer some of the features you need.


    Update based on OP’s comments: I wrote some code to enhance my answer. There are two versions: the first one is simpler:

    def find_matches():
        """
            Basic version: reads urls, but does not consider the semantic information of
            HTML header, body, etc. while computing duplicates.
        """
    
        from urllib2 import urlopen
        import hashlib
    
        urls = [ 'http://www.google.com', 'http://www.google.com/search']
    
        d = {}
        url_contents = {}
        matches = []
        for url in urls:
            c = urlopen(url)
            url_contents[url] = []
            while 1:
                r = c.read(4096)
                if not r: break
                md5 = hashlib.md5(r).hexdigest()
                url_contents[url].append(md5)
                if md5 in d:
                    url2 = d[md5]
                    matches.append((md5, url, url2))
                else:
                    d[md5] = []
                d[md5].append(url)
        #print url_contents
        print matches
    
    if __name__ == '__main__':
        find_matches()
    

    It was naive to expect the above code to detect duplicates in the expected way: the current web pages are much too complex. Therefore, even two urls that are the same to the eyes of a user actually have many differences due to ads, hash tags, self-url-name inclusion, etc.

    The second version is more sophisticated. It introduces a limited semantic analysis of the content based on BeautifulSoup:

    def find_matches():
        """
            Some consideration of the HTML header, body, etc. while computing duplicates.
        """
    
        from urllib2 import urlopen
        import hashlib
        from BeautifulSoup import BeautifulSoup
        import pprint
    
        urls = [ 'http://www.google.com', 'http://www.google.com/search'] # assuming all distinct urls
    
        def txt_md5(txt):
            return hashlib.md5(txt).hexdigest()
    
        MAX_FILE_SIZE = 1024*1024*1024 
        d = {}
        url_contents = {}
        matches = []
        for url in urls:
            try:
                c = urlopen(url)
                url_contents[url] = []
                r = c.read(MAX_FILE_SIZE)
                soup = BeautifulSoup(r)
                header = soup.find('head').text
                body = soup.find('body').text 
                # More fine-grained content options 
                # like h1, h2, p, etc., can be included.
                # Common CSS tags like page, content, etc.
                # can also be included.
                for h in [header, body]:
                    print h
                    md5 = txt_md5(h)
                    url_contents[url].append((md5, h))
                    if md5 in d:
                        url2 = d[md5]
                        matches.append((md5, url, url2))
                    else:
                        d[md5] = []
                    d[md5].append(url)
            except Exception as e:
                print "Exception", e
        print '---------------'
        #pprint.pprint(url_contents)
        print matches
    
    if __name__ == '__main__':
        find_matches()
    

    However, the second version too does not work. The reason remains the same. Indeed the difference between the head texts of the two urls was an included hash value, and the difference between the body texts of the two urls was a string webhp. I used difflib.context_diff to compute the difference.

    It is possible to enhance the code to include a third version that parses the web pages more intelligently and computes the diff more intelligently. For example, declaring as duplicates even the texts with <5% diff (this ratio can be easily computed using a difflib function).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some trouble figuring out the best way to store my programming todo
Having a hard time figuring out the best way to do this... I have
I'm having trouble figuring out the best way to encode the POST parameters to
I need some help figuring out the best way to proceed with creating a
I'm struggling a bit figuring out the best way accept and store a url
I'm having some trouble figuring out the best way to do this, and I
I'm having trouble with figuring out the best way to store some data in
I'm having trouble figuring out the best way to have a delphi function operate
I'm having trouble figuring out the best way to do this. I have a
I am interested in figuring out the best way to do straight through processing

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.