I’m trying to cluster some documents according to a tf-idf matrix using python.
First I follow the wikipedia definition of the formula, using normalised tf. http://en.wikipedia.org/wiki/Tf-idf
Feat_vectors starts as a two dimensional numpy array, with the rows representing documents and the columns representing terms, the values in each cell being the number of occurrences of each term in each document.
import numpy as np
feat_vectors /= np.max(feat_vectors,axis=1)[:,np.newaxis]
idf = len(feat_vectors) / (feat_vectors != 0).sum(0)
idf = np.log(idf)
feat_vectors *= idf
I then cluster these vectors using scipy:
from scipy.cluster import hierarchy
clusters = hierarchy.linkage(feat_vectors,method='complete',metric='cosine')
flat_clusters = hierarchy.fcluster(clusters, 0.8,'inconsistent')
However, on that last line it throws an error:
ValueError: Linkage 'Z' contains negative distances.
Cosine similarity goes from -1 to 1. However, the wikipedia page for cosine similarity states http://en.wikipedia.org/wiki/Cosine_similarity :
In the case of information retrieval, the cosine similarity of two documents will range >from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative.
So if I am getting a negative similarity, it seems that I am making some error in calculating tf-idf. Any ideas what my mistake is?
I suspect the error is in the following line:
since your logical vector is going to be converted to an int in the sum, and len is an int, you’re losing precision. Replacing with:
worked for me (i.e. produced what I was expecting with dummy data). Everything else looks correct.