I’m writing a crawler to download the static html pages using urllib . The

Question

0

Asked: June 12, 20262026-06-12T01:01:04+00:00 2026-06-12T01:01:04+00:00

I’m writing a crawler to download the static html pages using urllib . The

0

I’m writing a crawler to download the static html pages using urllib.

The get_page function works for 1 cycle but when i try to loop it, it doesn’t open the content to the next url i’ve fed in.

How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?

my code below only returns the html for the 1st website in the seed list:

import urllib
def get_page(url):
    return urllib.urlopen(url).read().decode('utf8')

seed = ['http://www.pmo.gov.sg/content/pmosite/home.html', 
            'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']    

for j in seed:
      print "here"
      print get_page(j)

The same crawl “once-only” problem also occurs with urllib2:

import urllib2
def get_page(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    return response.read().decode('utf8')

seed = ['http://www.pmo.gov.sg/content/pmosite/home.html', 
            'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']    

for j in seed:
      print "here"
      print get_page(j)

Without the exception, i’m getting an IOError with urllib:

Traceback (most recent call last):
  File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
    print get_page(j)
  File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
    return urllib.urlopen(url).read().decode('utf8')
  File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.7/urllib.py", line 207, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 462, in open_file
    return self.open_local_file(url)
  File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'

Without the exception, i’m getting a ValueError with urllib2:

Traceback (most recent call last):
  File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
    print get_page(j)
  File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
    response = urllib2.urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 392, in open
    protocol = req.get_type()
  File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html

ANSWERED:

The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T01:01:05+00:00

Editorial Team

2026-06-12T01:01:05+00:00Added an answer on June 12, 2026 at 1:01 am

Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a crawler to download the static html pages using urllib . The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply