I need to calculate standard deviation and other stats on a large multidimensional ndarray of gridded point data. Example:
import numpy as np
# ... gridded data are read into g1, g2, g3 arrays ...
allg = numpy.array( [g1, g2, g3] )
allmg = numpy.ma.masked_values(allg, -99.)
sd = numpy.zeros((3, 3315, 8325))
np.std(allmg, axis=0, ddof=1, out=sd)
I’ve seen the performance advantages of wrapping numpy calculations in numexpr.evaluate() on various websites but I don’t think there’s a way to run np.std() in numexpr.evaluate() (correct me if I’m wrong). Are there any other ways I can optimize the np.std() call? It currently takes about 18 sec to calculate on my system…hoping to make that much faster somehow…
Maybe you can use multiprocessing to do the calculation in several process. But before trying that, you can try to rearrange your data so that you can call std() for the last axis. Here is an example:
the result on my pc is :
since all the data are in continuous memory for the last axis, data access will use CPU cache more effectively.