I have 2 arrays in 2D, where the column vectors are feature vectors. One array is of size F x A, the other of F x B, where A << B. As an example, for A = 2 and F = 3 (B can be anything):
arr1 = np.array( [[1, 4],
[2, 5],
[3, 6]] )
arr2 = np.array( [[1, 4, 7, 10, ..],
[2, 5, 8, 11, ..],
[3, 6, 9, 12, ..]] )
I want to calculate the distance between arr1 and a fragment of arr2 that is of equal size (in this case, 3×2), for each possible fragment of arr2. The column vectors are independent of each other, so I believe I should calculate the distance between each column vector in arr1 and a collection of column vectors ranging from i to i + A from arr2 and take the sum of these distances (not sure though).
Does numpy offer an efficient way of doing this, or will I have to take slices from the second array and, using another loop, calculate the distance between each column vector in arr1 and the corresponding column vector in the slice?
Example for clarity, using the arrays stated above:
>>> magical_distance_func(arr1, arr2[:,:2])
[0, 10.3923..]
>>> # First, distance between arr2[:,:2] and arr1, which equals 0.
>>> # Second, distance between arr2[:,1:3] and arr1, which equals
>>> diff = arr1 - np.array( [[4,7],[5,8],[6,9]] )
>>> diff
[[-3, -3], [-3, -3], [-3, -3]]
>>> # this happens to consist only of -3's. Norm of each column vector is:
>>> norm1 = np.linalg.norm([:,0])
>>> norm2 = np.linalg.norm([:,1])
>>> # would be extremely good if this worked for an arbitrary number of norms
>>> totaldist = norm1 + norm2
>>> totaldist
10.3923...
Of course, transposing the arrays is fine too, if that means that cdist can somehow be used here.
If I understand your question correctly, this will work. Knowing
numpy, there’s probably a better way, but this is at least fairly straightforward. I used some contrived coordinates to show that the calculation is working as expected.You can subtract
arr1fromarr2by ensuring that they broadcast against each other correctly. The best way I could think of involves taking a transpose and doing some reshaping. These don’t create copies — they create views — so this isn’t so wasteful. (distis a copy though.)Now all we have to do is apply
numpy.linalg.normacross axis 1. (You can select from among several norms).Assuming you want simple euclidean distance, you can also do it directly; not sure whether this will be faster or slower so try both:
Based on your edit, we have to do only one small tweak. Since you want to test the columns pairwise, rather than blockwise, you need a rolling window. This can be done very simply with fairly straightforward indexing:
Combining that with the other tricks:
However, converting arrays from list comprehensions tends to be slow. It might be faster to use stride_tricks — here again, see which one suits your purposes best:
This actually manipulates the way
numpymoves over a block of memory, allowing a small array to emulate a bigger array.So now you have a simple 2-d array corresponding to distances for each pair of columns. Now it’s just a matter of getting the
meanand callingargmin.