[An earlier version of this post got absolutely no response, so, in case this was due to a lack of clarity, I’ve reworked it, with additional explanations and code comments.]
I want to compute a mean and standard deviation over elements of a numpy n-dimensional array that do not correspond to a single axis (but rather to k > 1 non-consecutive axes), and get the results collected in a new (n – k + 1)-dimensional array.
Does numpy include standard constructs to perform this operation efficiently?
The function mu_sigma copied below is my best attempt at solving this problem, but it has two glaring inefficiencies: 1) it requires making a copy of the original data; 2) it computes the mean twice (since the computation of the standard deviation requires computing the mean).
The mu_sigma function takes two arguments: box and axes. box is an n-dimensional numpy array (aka “ndarray”), and axes is a k-tuple of integers, representing (not necessarily consecutive) dimensions of box. The function returns a new (n – k + 1)-dimensional ndarray containing the mean and standard deviation computed over the “hyperslabs” of box represented by the k specified axes.
The code below also includes an example of mu_sigma in action. In this example, the box argument is a 4 x 2 x 4 x 3 x 4 ndarray of floating-point numbers, and the axes argument is the tuple (1, 3). (Hence, we have n == len(box.shape) == 5, and k == len(axes) == 2.) The result (which here I’ll call outbox) returned for this sample input is a 4 x 4 x 4 x 2 ndarray of floating point numbers. For each triplet of indices i, k, j (where each index ranges over the set {0, 1, 2, 3}), the element outbox[i, j, k, 0] is the mean of the 6 elements specified by the numpy expression box[i, 0:2, j, 0:3, k]. Similarly, outbox[i, j, k, 1] is the standard deviation of the same 6 elements. This means that the first n – k == 3 dimensions of the result range over the same indices as do the n – k non-axes dimensions of the input ndarray box, which in this case are dimensions 0, 2 and 4.
The strategy used in mu_sigma is to
- permute the dimensions (using the
transposemethod) so that the axes specified in the function’s second argument are all put at the end; the remaining (non-axes) dimensions are left at the beginning (in their original ordering); - collapse the axes dimensions into one (by using the
reshapemethod); the new “collapsed” dimension is now the last dimension of the reshaped ndarray; - compute an ndarray of the means using the last “collapsed” dimension as axis;
- compute an ndarray of the standard deviations using the last “collapsed” dimension as axis;
- return an ndarray obtained from concatenating the ndarrays produced in (3) and (4)
import numpy as np
def mu_sigma(box, axes):
inshape = box.shape
# determine the permutation needed to put all the dimensions given in axes
# at the end (otherwise preserving the relative ordering of the dimensions)
nonaxes = tuple([i for i in range(len(inshape)) if i not in set(axes)])
# permute the dimensions
permuted = box.transpose(nonaxes + axes)
# determine the shape of the ndarray after permuting the dimensions and
# collapsing the axes-dimensions; thanks to Bago for the "+ (-1,)"
newshape = tuple(inshape[i] for i in nonaxes) + (-1,)
# collapse the axes-dimensions
# NB: the next line results in copying the input array
reshaped = permuted.reshape(newshape)
# determine the shape for the mean and std ndarrays, as required by
# the subsequent call to np.concatenate (this reshaping is not necessary
# if the available mean and std methods support the keepdims keyword;
# instead, just set keepdims to True in both calls).
outshape = newshape[:-1] + (1,)
# compute the means and standard deviations
mean = reshaped.mean(axis=-1).reshape(outshape)
std = reshaped.std(axis=-1).reshape(outshape)
# collect the results in a single ndarray, and return it
return np.concatenate((mean, std), axis=-1)
inshape = 4, 2, 4, 3, 4
inbuf = np.array(map(float, range(np.product(inshape))))
inbox = np.ndarray(inshape, buffer=inbuf)
outbox = mu_sigma(inbox, tuple(range(len(inshape))[1::2]))
# "inline tests"
assert all(outbox[..., 1].ravel() ==
[inbox[0, :, 0, :, 0].std()] * outbox[..., 1].size)
assert all(outbox[..., 0].ravel() == [float(4*(v + 3*w) + x)
for v in [8*y - 1
for y in [3*z + 1
for z in range(4)]]
for w in range(4)
for x in range(4)])
It looks like this got a little easier as of numpy 2.0.
http://projects.scipy.org/numpy/ticket/1234