I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been “consumed” by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
A generator simply yields values. There’s no way for the generator to know what’s being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn’t execute again until the caller invokes it to get the next value. It doesn’t run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?