I’m trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
- Compute vector of inner products between doc
idxand all other documents - Sort in descending order
- Return the “scores” and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
- Can this be done more efficiently?
- Can this be done without converting the sparse matrix using
toarray()?
I don’t think there’s any real need to skip the
toarray. Thevarray will be onlyn_docslong, which is dwarfed by the size of then_docs×n_termstf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you’re storing is very sparse (I’ve seen >80% figures for Matlab and assume that Scipy will be similar, though I don’t have an exact figure).The double sort can be skipped by doing
Btw., your use of
np.inneron sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is