I have an iterator which is supposed to run for several days. I want errors to be caught and reported, and then I want the iterator to continue. Or the whole process can start over.
Here’s the function:
def get_units(self, scraper):
units = scraper.get_units()
i = 0
while True:
try:
unit = units.next()
except StopIteration:
if i == 0:
log.error("Scraper returned 0 units", {'scraper': scraper})
break
except:
traceback.print_exc()
log.warning("Exception occurred in get_units", extra={'scraper': scraper, 'iteration': i})
else:
yield unit
i += 1
Because scraper could be one of many variants of code, it can’t be trusted and I don’t want to handle the errors there.
But when an error occurs in units.next(), the whole thing stops. I suspect because an iterator throws a StopIteration when one of it’s iterations fails.
Here’s the output (only the last lines)
[2012-11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG] Scraping unit <Element div at 0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article Counter-Strike: Global Offensive Update Released
Traceback (most recent call last):
File "/home/amcat/amcat/scraping/controller.py", line 101, in get_units
unit = units.next()
File "/home/amcat/amcat/scraping/scraper.py", line 114, in get_units
for unit in self._get_units():
File "/home/amcat/scraping/games/steamcommunity.py", line 90, in _get_units
app_doc = self.getdoc(url,urlencode(form))
File "/home/amcat/amcat/scraping/scraper.py", line 231, in getdoc
return self.opener.getdoc(url, encoding)
File "/home/amcat/amcat/scraping/htmltools.py", line 54, in getdoc
response = self.opener.open(url, encoding)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110 WARNING] Exception occurred in get_units
...code ends...
So how can I prevent the iterating to stop when an error occurs?
EDIT: here’s the code within get_units()
def get_units(self):
"""
Split the scraping job into a number of 'units' that can be processed independently
of each other.
@return: a sequence of arbitrary objects to be passed to scrape_unit
"""
self._initialize()
for unit in self._get_units():
yield unit
And here’s a simplified _get_units():
INDEX_URL = "http://www.steamcommunity.com"
def _get_units(self):
doc = self.getdoc(INDEX_URL) #returns a lxml.etree document
for a in doc.cssselect("div.discussion a"):
link = a.get('href')
yield link
EDIT: question followup: Alter each for-loop in a function to have error handling executed automatically after each failed iteration
StopIterationis raised by thenext()method of a generator when there is no next item anymore. It has nothing to do with errors inside the generator/iterator.Another thing to note is that, depending on the type of your iterator, it might not be able to resume after an exception. If the iterator is an object with a
nextmethod, it will work. However, if it’s actually a generator, it won’t.As far as I can tell, this is the only reason why your iteration doesn’t continue after an error from
units.next(). I.e.units.next()fails, and the next time you call it, it’s not able to resume and it says it’s done by throwing aStopIterationexception.Basically you’d have to show us the code inside
scraper.get_units()for us to understand why the loop is not able to continue after an error inside a single iteration. Ifget_units()is implemented as a generator function, it’s clear. If not, it might be something else that’s preventing it from resuming.UPDATE: explaining what a generator function is:
Now, when you call
Scraper().get_units(), instead of running the entire function, it returns a generator object. Callingnext()on it, will take the execution to the firstyield. Etc. Now if an error occurs ANYWHERE insideget_units, it will be tainted, so to say, and the next time you callnext(), it will raiseStopIteration, just as if it had run out of items to give you.Reading of http://www.dabeaz.com/generators/ (and http://www.dabeaz.com/coroutines/) strongly recommended.
UPDATE2: A possible solution https://gist.github.com/4175802