I have a working Python script that checks the 6,300 or so sites we have to ensure they are up by sending an HTTP request to each and measuring the response. Currently the script takes about 40 min to run completely, I was interested in possibly some other ways to speed up the script, two thoughts were either threading or multiple running instances.
This is the order of execution now:
- MySQL query to get all of the active domains to scan (6,300 give or take)
- Iterate through each domain and using urllib send an HTTP request to each
- If the site doesn’t return ‘200’ then log the results
- repeat until complete
This seems like it could possibly be sped up significantly with threading but I am not quite sure how that process flow would look since I am not familiar with threading.
If someone could offer a sample high-level process flow and any other pointers for working with threading or offer any other insights on how to improve the script in general it would be appreciated.
The flow would look something like this:
You’ll probably want to tune the number of threads, thus the pool, and not just 6300 threads for every domain.