scipy.spatial.distance.pdist returns a condensed distance matrix. From the documentation:
Returns a condensed distance matrix Y. For each and (where ), the metric dist(u=X[i], v=X[j]) is computed and stored in entry ij.
I thought ij meant i*j. But I think I might be wrong. Consider
X = array([[1,2], [1,2], [3,4]])
dist_matrix = pdist(X)
then the documentation says that dist(X[0], X[2]) should be dist_matrix[0*2]. However, dist_matrix[0*2] is 0 — not 2.8 as it should be.
What’s the formula I should use to access the similarity of a two vectors, given i and j?
You can look at it this way: Suppose
xis m by n. The possible pairs ofmrows, chosen two at a time, isitertools.combinations(range(m), 2), e.g, form=3:So if
d = pdist(x), thekth tuple incombinations(range(m), 2))gives the indices of the rows ofxassociated withd[k].Example:
The first element is
dist(x[0], x[1]), the second isdist(x[0], x[2])and the third isdist(x[1], x[2]).Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.
E.g.