We’ve got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects — this takes about 20 seconds. We’re using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn’t help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?
Try the marshal module – it’s internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn’t serialize arbitrary instances like pickle, only builtin types (don’t remember the exact constraints, see docs). Also note that the format isn’t stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you’ll only use a small part of the objects, then an OO database (like Zope’s) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don’t know, just a thought…