How can I modify my script to skip a URL if the connection times

Question

0

Editorial Team

Asked: June 3, 20262026-06-03T20:18:24+00:00 2026-06-03T20:18:24+00:00

How can I modify my script to skip a URL if the connection times

0

How can I modify my script to skip a URL if the connection times out or is invalid/404?

Python

#!/usr/bin/python

#parser.py: Downloads Bibles and parses all data within <article> tags.

__author__      = "Cody Bouche"
__copyright__   = "Copyright 2012 Digital Bible Society"

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
    url = link.get('href')
    name = urlparse.urlparse(url).path.split('/')[-1]
    dirname = urlparse.urlparse(url).path.split('.')[-1]
    f = urllib2.urlopen(url)
    s = f.read()
    if (os.path.isdir(dirname) == 0):
        os.mkdir(dirname)
    soup = BeautifulSoup(s)
    articleTag = soup.html.body.article
    converted = str(articleTag)
    full_path = os.path.join(dirname, name)
    open(full_path, 'wb').write(converted)
    print(name)
print("DOWNLOADS COMPLETE!")

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T20:18:26+00:00

To apply the timeout to your request add the timeout variable to your call to urlopen. From the docs:

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.

Refer to this guide’s section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.

The request timeout exception code is 408. Wrapping it up, if you were to handle timeout exceptions you would:

try:
    response = urlopen(req, 3) # 3 seconds
except URLError, e:
    if hasattr(e, 'code'):
        if e.code==408:
            print 'Timeout ', e.code
        if e.code==404:
            print 'File Not Found ', e.code
        # etc etc

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How can I modify my script to skip a URL if the connection times

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply