I am trying to implement a simple web crawler and I have already written

Question

0

Asked: June 17, 20262026-06-17T20:50:01+00:00 2026-06-17T20:50:01+00:00

I am trying to implement a simple web crawler and I have already written

0

I am trying to implement a simple web crawler and I have already written a simple code to start off : There are two modules fetcher.py and crawler.py. Here are the files :

fetcher.py :

    import urllib2
    import re
    def fetcher(s):
    "fetch a web page from a url"

    try:
            req = urllib2.Request(s)
            urlResponse = urllib2.urlopen(req).read()
    except urllib2.URLError as e:
            print e.reason
            return

    p,q = s.split("//")
    d = q.split("/")
    fdes = open(d[0],"w+")
    fdes.write(str(urlResponse))
    fdes.seek(0)
    return fdes



    if __name__ == "__main__":
    defaultSeed = "http://www.python.org"
    print fetcher(defaultSeed)

crawler.py :

from bs4 import BeautifulSoup
import re
from fetchpage import fetcher    

usedLinks = open("Used","a+")
newLinks = open("New","w+")

newLinks.seek(0)

def parse(fd,var=0):
        soup = BeautifulSoup(fd)
        for li in soup.find_all("a",href=re.compile("http")):
                newLinks.seek(0,2)
                newLinks.write(str(li.get("href")).strip("/"))
                newLinks.write("\n")

        fd.close()
        newLinks.seek(var)
        link = newLinks.readline().strip("\n")

        return str(link)


def crawler(seed,n):
        if n == 0:
                usedLinks.close()
                newLinks.close()
                return
        else:
                usedLinks.write(seed)
                usedLinks.write("\n")
                fdes = fetcher(seed)
                newSeed = parse(fdes,newLinks.tell())
                crawler(newSeed,n-1)

if __name__ == "__main__":
        crawler("http://www.python.org/",7)

The problem is that when i run crawler.py it works fine for the first 4-5 links and then it hangs and after a minute gives me the following error :

[Errno 110] Connection timed out
   Traceback (most recent call last):
  File "crawler.py", line 37, in <module>
    crawler("http://www.python.org/",7)
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
 File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 33, in crawler
    newSeed = parse(fdes,newLinks.tell())
  File "crawler.py", line 11, in parse
    soup = BeautifulSoup(fd)
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
    self.builder.prepare_markup(markup, from_encoding))
  File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in     prepare_markup
    dammit = UnicodeDammit(markup, try_encodings, is_html=True)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
    self._detectEncoding(markup, is_html)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
    xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer

Can anyone help me with this, I am very new to python and I am unable to find out why does it say connection timed out after some time ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T20:50:02+00:00

A Connection Timeout is not specific to python, it just means that you made a request to the server, and the server did not respond within the amount of time that your application was willing to wait.

On very possible reason that this could occur is that python.org may have some mechanism to detect when it is getting multiple requests from a script, and probably just completely stops serving pages after 4-5 requests. There is nothing you can really do to avoid this other than trying out your script on a different site.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to implement a simple web crawler and I have already written

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply