I have a complex data structure (user-defined type) on which a large number of independent calculations are performed. The data structure is basically immutable. I say basically, because though the interface looks immutable, internally some lazy-evaluation is going on. Some of the lazily calculated attributes are stored in dictionaries (return values of costly functions by input parameter).
I would like to use Pythons multiprocessing module to parallelize these calculations. There are two questions on my mind.
- How do I best share the data-structure between processes?
- Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?
Thanks in advance for any answers, comments or enlightening questions!
Pipelines.
Break your program up so that each calculation is a separate process of the following form.
For testing, you can use it like this.
For reproducing the whole calculation in a single process, you can do this.
You can wrap each transformation with a little bit of file I/O. Pickle works well for this, but other representations (like JSON or YAML) work well, too.
Each processing step becomes an independent OS-level process. They will run concurrently and will — immediately — consume all OS-level resources.
Pipelines.