I’m working on a bit of python code that uses mechanize to grab data from another website. Because of the complexity of the website the code takes 10-30 seconds to complete. It has to work its way through a couple pages and such.
I plan on having this piece of code being called fairly frequently. I’m wondering the best way to implement something like this without causing a huge server load. Since I’m fairly new to python I’m not sure how the language works.
If the code is in the middle of processing one request and another user calls the code, can two instances of the code run at once? Is there a better way to implement something like this?
I want to design it in a way that it can complete the hefty tasks without being too taxing on the server.
My data grabbing scripts all use caching to minimize load on the server. Be sure to use HTTP tools such as, “If-Modified-Since”, “If-None-Match”, and “Accept-Encoding: gzip”.
Also consider using the multiprocessing module so that you can run requests in parallel.
Here is an excerpt from my downloader script:
The cache is an instance of shelve, a persistent dictionary.
Calls to the downloader are parallelized with imap_unordered() on a multiprocessing Pool instance.