I have a large (over 600,000 records) database as part of a Django app. The app stores information collected from various open data web services. Every so often (maybe once a week or less) I need to check these web services to see if any of the data has been updated.
I’ve written a python script to do this. It works, but it is very slow and I often get this error well before it can complete: ConnectionError: [Errno 104] Connection reset by peer
Based on some experiments, I think this process will take several days to complete. Besides optimizing my script, what is the best way to handle this type of long running python process?
Have a look at celery it should easily enable you to assign background jobs to multiple workers (which can also run on different machines). Also it enables you to queue jobs again if the fail and retry later on…
To optimize you your scripts you should probably look into using multiprocessing or using asynchronous libraries like gevent (escpecially if you have jobs that do a lot of I/O like calling web services) which enable you to process a lot of simultanous connections (up to 100s/1000s) in parallel.