I am trying to search the source code of a webpage, and download various files from it using Python. This script searches the source code for .jpg files and downloads them all as expected. However, upon modifying the script (changing “.jpg” to “.png”, as shown below), I get the error:
Traceback (most recent call last):
File "img.py", line 19, in <module> urllib.urlretrieve(images[z], "image"+str(z)+".png")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 91, in urlretrieve
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 237, in retrieve
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 205, in open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 461, in open_file
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 475, in open_local_file
IOError: [Errno 2] No such file or directory: '/images/adapt-icon-search.png?1342791397'
Here is the script I am using:
import urllib
import urllib2
import re
print "enter url of site (such as 'dribbble.com')"
url = raw_input()
fullurl = "http://"+url
src = urllib2.urlopen(fullurl)
src = src.read()
images = re.findall('src="(.*\.png[^"]*)', src)
z=0
while z < len(images):
urllib.urlretrieve(images[z], "image"+str(z)+".png")
print "done"
z+=1
Insight as to why this script doesn’t work for .png files would be much appreciated. Many thanks in advance.
UPDATE: below is a sample of the source I am wanting to search through:
<span rel="tipsy" title="This shot has rebounds." class="rebound-mark has-rebounds">1</span>
</a>
</div>
</div>
<h2>
<a href="/Dash" class="url" rel="contact" title="Dash"><img alt="Avatar-new" class="photo fn" src="http://dribbble.s3.amazonaws.com/users/107759/avatars/original/avatar-new.png?1339961321" /> Dash</a>
<a href="/account/pro" class="badge-link">
<span class="badge badge-pro">Pro</span>
</a>
</h2>
So the error you are getting is this:
What’s happening is that the web page you’re scraping has some PNG references that do not include the domain name included in the URL. When you try to fetch them in your
whileloop, it fails because you’re only providing the location on the remote host:/images/adapt-icon-search.png?1342791397.You need to extend your code to detect those kinds of URLs (which are perfectly legal, and in fact, very common). For the kind you’re hitting here, you’ll just need to prepend the matched URL with the host name of the server (e.g.
http://dribble.com/).You will probably also want to handle relative URLs, which also exclude the hostname, but start without a
/character. Those will need to be prepended with the previous page’s path, if there was one. So if you were scrapinghttp://dribble.com/foo/bar.html, you’d need to prepend a relative URL withhttp://dribble.com/foo/.There’s likely a library that will automate handling of non-absolute URLs for you, perhaps as part of the web-scraping process. I’m afraid I don’t know much about web scraping first hand, but perhaps somebody else can suggest one in a comment.