How can I modify my script to skip a URL if the connection times out or is invalid/404?
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
To apply the timeout to your request add the
timeoutvariable to your call tourlopen. From the docs:Refer to this guide’s section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.
The
request timeoutexception code is408. Wrapping it up, if you were to handle timeout exceptions you would: