I am new to python, and i am developing a web crawler below is

Question

0

Asked: June 1, 20262026-06-01T23:00:57+00:00 2026-06-01T23:00:57+00:00

I am new to python, and i am developing a web crawler below is

0

I am new to python, and i am developing a web crawler below is the program which get the links from given url, but the problem is i dont want it to visit the same url which is already visited. please help me.

import re
import urllib.request
import sqlite3
db = sqlite3.connect('test2.db')
db.row_factory = sqlite3.Row
db.execute('drop table if exists test')
db.execute('create table test(id INTEGER PRIMARY KEY,url text)')
#linksList = []
#module to vsit the given url and get the all links in that page
def get_links(urlparse):
        try:
            if urlparse.find('.msi') ==-1: #check whether the url contains .msi extensions
                htmlSource = urllib.request.urlopen(urlparse).read().decode("iso-8859-1")  
                #parsing htmlSource and finding all anchor tags
                linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource) #returns href and other attributes of a tag
                for link in linksList: 
                    start_quote = link.find('"') # setting start point in the link 
                    end_quote = link.find('"', start_quote + 1) #setting end point in the link
                    url = link[start_quote + 1:end_quote] # get the string between start_quote and end_quote
                    def concate(url): #since few href may return only /contact or /about so concatenating its baseurl
                        if url.find('http://'):
                            url = (urlparse) + url
                            return url
                        else:
                            return url
                    url_after_concate = concate(url)
#                    linksList.append(url_after_concate)
                    try:
                        if url_after_concate.find('.tar.bz') == -1: # skipping links which containts link to some softwares or downloads page
                            db.execute('insert or ignore into test(url) values (?)', [url_after_concate])
                    except:
                        print("insertion failed")
            else:
                return True
        except:
            print("failed")
get_links('http://www.python.org')     
cursor = db.execute('select * from test')
for row in cursor: # retrieve the links stored in database
    print (row['id'],row['url'])
    urlparse = row['url']
#    print(linksList)
#    if urlparse in linksList == -1:
    try:
        get_links(urlparse) # again parse the link from database
    except:
        print ("url error")

Please suggest me the way how to solve the problem.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T23:00:59+00:00

Editorial Team

2026-06-01T23:00:59+00:00Added an answer on June 1, 2026 at 11:00 pm

You should have a list of ‘visited’ pages. When you come to request the next url you can check whether the list already contains the url and if so skip it. I’m not a python programmer so here’s some peusdo-code

Create listOfVisitedUrls
...
Start Loop
Get nextUrl
If nextUrl IsNotIn listOfVisitedUrls Then
    Request nextUrl
    Add nextUrl to listOfVisitedUrls
End If
Loop

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to python, and i am developing a web crawler below is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply