I’m working on a bit of python code that uses mechanize to grab data

Question

0

Asked: June 8, 20262026-06-08T19:31:23+00:00 2026-06-08T19:31:23+00:00

I’m working on a bit of python code that uses mechanize to grab data

0

I’m working on a bit of python code that uses mechanize to grab data from another website. Because of the complexity of the website the code takes 10-30 seconds to complete. It has to work its way through a couple pages and such.

I plan on having this piece of code being called fairly frequently. I’m wondering the best way to implement something like this without causing a huge server load. Since I’m fairly new to python I’m not sure how the language works.

If the code is in the middle of processing one request and another user calls the code, can two instances of the code run at once? Is there a better way to implement something like this?

I want to design it in a way that it can complete the hefty tasks without being too taxing on the server.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T19:31:27+00:00

My data grabbing scripts all use caching to minimize load on the server. Be sure to use HTTP tools such as, “If-Modified-Since”, “If-None-Match”, and “Accept-Encoding: gzip”.

Also consider using the multiprocessing module so that you can run requests in parallel.

Here is an excerpt from my downloader script:

def urlretrieve(url, filename, cache, lock=threading.Lock()):
    'Read contents of an open url, use etags and decompress if needed'    

    request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'gzip')
    with lock:
        if ('etag ' + url) in cache:
            request.add_header('If-None-Match', cache['etag ' + url])
        if ('date ' + url) in cache:
            request.add_header('If-Modified-Since', cache['date ' + url])

    try:
        u = urllib2.urlopen(request)
    except urllib2.HTTPError as e:
        return Response(e.code, e.msg, False, False)
    content = u.read()
    u.close()

    compressed = u.info().getheader('Content-Encoding') == 'gzip'
    if compressed:
        content = gzip.GzipFile(fileobj=StringIO.StringIO(content), mode='rb').read()

    written = writefile(filename, content) 

    with lock:
        etag = u.info().getheader('Etag')
        if etag:
            cache['etag ' + url] = etag
        timestamp = u.info().getheader('Date')
        if timestamp:
            cache['date ' + url] = timestamp

    return Response(u.code, u.msg, compressed, written)

The cache is an instance of shelve, a persistent dictionary.

Calls to the downloader are parallelized with imap_unordered() on a multiprocessing Pool instance.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a bit of python code that uses mechanize to grab data

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply