In doing some bioinformatics work, I’ve been pondering the ramifications of storing object instances in a Numpy array rather than a Python list, but in all the testing I’ve done the performance was worse in every instance. I am using CPython. Does anyone know the reason why?
Specifically:
- What are the performance impacts of using a fixed-length array
numpy.ndarray(dtype=object)vs. a regular Python list? Initial tests I performed showed that accessing the Numpy array elements was slower than iteration through the Python list, especially when using object methods. - Why is it faster to instantiate objects using a list comprehension such as
[ X() for i in range(n) ]instead of anumpy.empty(size=n, dtype=object)? - What is the memory overhead of each? I was not able to test this. My classes extensively use
__slots__, if that has any impact.
Don’t use object arrays in numpy for things like this.
They defeat the basic purpose of a numpy array, and while they’re useful in a tiny handful of situations, they’re almost always a poor choice.
Yes, accessing an individual element of a numpy array in python or iterating through a numpy array in python is slower than the equivalent operation with a
list. (Which is why you should never do something likey = [item * 2 for item in x]whenxis a numpy array.)Numpy object arrays will have a slightly lower memory overhead than a list, but if you’re storing that many individual python objects, you’re going to run into other memory problems first.
Numpy is first and foremost a memory-efficient, multidimensional array container for uniform numerical data. If you want to hold arbitrary objects in a numpy array, you probably want a list, instead.
My point is that if you want to use numpy effectively, you may need to re-think how you’re structuring things.
Instead of storing each object instance in a numpy array, store your numerical data in a numpy array, and if you need separate objects for each row/column/whatever, store an index into that array in each instance.
This way you can operate on the numerical arrays quickly (i.e. using numpy instead of list comprehensions).
As a quick example of what I’m talking about, here’s a trivial example without using numpy:
And a similar example using numpy arrays:
There are other ways to do this (you may want to avoid storing a reference to a specific numpy array in each
point, for example), but I hope it’s a useful example.Note the difference in speed at which they run. On my machine, it’s a difference of 5 seconds for the numpy version vs 60 seconds for the pure-python version.