Is the cache in a standard memoize decorator process-safe?
For example, suppose I define the following decorator:
import functools
def memoize(func):
cache = {}
@functools.wraps(func)
def memoized(*args):
result = None
if args in cache:
result = cache[args]
else:
result = func(*args)
cache[args] = result
return result
return memoized
and suppose I am trying to use it to speed up computation of a recursive function, say:
@memoize
def fib(n):
result = 1
if n > 1:
result = fib(n-1) + fib(n-2)
return result
Now I wonder if two processes calculating fib() could ever clash? For example:
if __name__ == "__main__":
from multiprocessing import Process
p1 = Process(target=fib, args=(19,))
p2 = Process(target=fib, args=(23,))
p1.start()
p2.start()
p1.join()
p2.join()
My first thought was that the cache is saved in the context of fib, so it is
shared between the processes and that could lead to race conditions. But then,
I think that the worst that could happen is that they would both think that, say,
fib(17) has not been calculated, and will both go ahead and calculated it in
parallel and store the same result one after the other- not ideal,
but not horrible, I guess. But I still wonder if there is a way to do it in a process-safe way.
EDIT: I added a print statement in each of the branches of memoized(),
and it seems that each process re-calculates all the fib values in the cache.
Perhaps the cache is not shared, after all? If it is not shared, I wounder
if there is a process-safe way to share it (to save some more computations).
By default, multiprocess programs in Python share very little between processes. The few things that are shared are
pickled, which comes with some limitations of its own. Thefibfunction in your example is nominally shared, butpicklestores functions by name, not by value. That is why its cache doesn’t get shared.If you want to have a synchronized cache for your
memoizedecorator, you’ll need to add synchronization to it, such as amultiprocessing.Queueormultiprocessing.Array. This may be slower than simply letting each process recalculate the values though, since it introduces a lot of overhead as the processes pass the updates back and forth.Alternatively, if you don’t need your separate processes to be tightly synchronized while they’re running, you could come up with a method of passing the cache to and from the processes when they start and stop (e.g. using an extra argument and return value), so that sequential calls could benefit from the memoization, even if parallel calls do not.