I am trying to search the source code of a webpage, and download various

Question

0

Asked: June 8, 20262026-06-08T18:53:53+00:00 2026-06-08T18:53:53+00:00

I am trying to search the source code of a webpage, and download various

0

I am trying to search the source code of a webpage, and download various files from it using Python. This script searches the source code for .jpg files and downloads them all as expected. However, upon modifying the script (changing “.jpg” to “.png”, as shown below), I get the error:

Traceback (most recent call last):
File "img.py", line 19, in <module> urllib.urlretrieve(images[z], "image"+str(z)+".png")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 91, in urlretrieve
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 237, in retrieve
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 205, in open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 461, in open_file
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 475, in open_local_file
IOError: [Errno 2] No such file or directory: '/images/adapt-icon-search.png?1342791397'

Here is the script I am using:

import urllib
import urllib2
import re

print "enter url of site (such as 'dribbble.com')"

url = raw_input()
fullurl = "http://"+url

src = urllib2.urlopen(fullurl)
src = src.read()

images = re.findall('src="(.*\.png[^"]*)', src)

z=0
while z < len(images):
    urllib.urlretrieve(images[z], "image"+str(z)+".png")
    print "done"
    z+=1

Insight as to why this script doesn’t work for .png files would be much appreciated. Many thanks in advance.

UPDATE: below is a sample of the source I am wanting to search through:

<span rel="tipsy" title="This shot has rebounds." class="rebound-mark has-rebounds">1</span>
                </a>            
        </div>
    </div>
    <h2>
        <a href="/Dash" class="url" rel="contact" title="Dash"><img alt="Avatar-new" class="photo fn" src="http://dribbble.s3.amazonaws.com/users/107759/avatars/original/avatar-new.png?1339961321" /> Dash</a>
        <a href="/account/pro" class="badge-link">
    <span class="badge badge-pro">Pro</span>
</a>
    </h2>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T18:53:55+00:00

So the error you are getting is this:

IOError: [Errno 2] No such file or directory: ‘/images/adapt-icon-search.png?1342791397’

What’s happening is that the web page you’re scraping has some PNG references that do not include the domain name included in the URL. When you try to fetch them in your while loop, it fails because you’re only providing the location on the remote host: /images/adapt-icon-search.png?1342791397.

You need to extend your code to detect those kinds of URLs (which are perfectly legal, and in fact, very common). For the kind you’re hitting here, you’ll just need to prepend the matched URL with the host name of the server (e.g. http://dribble.com/).

You will probably also want to handle relative URLs, which also exclude the hostname, but start without a / character. Those will need to be prepended with the previous page’s path, if there was one. So if you were scraping http://dribble.com/foo/bar.html, you’d need to prepend a relative URL with http://dribble.com/foo/.

There’s likely a library that will automate handling of non-absolute URLs for you, perhaps as part of the web-scraping process. I’m afraid I don’t know much about web scraping first hand, but perhaps somebody else can suggest one in a comment.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to search the source code of a webpage, and download various

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply