I am using the Python sklearn libraries.
I have 150,000+ sentences.
I need an array-like object, where each row is for a sentences, each column corresponds to a word, and each element is the number of words in that sentence.
For example: If the two sentences were “The dog ran” and “The boy ran”, I need
[ [1, 1, 1, 0]
, [0, 1, 1, 1] ]
(the order of the columns is irrelevant, and depends on which column is assigned to which word)
My array will be sparse (each sentence will have a fraction of the possible words), and so I am using scipy.sparse.
def word_counts(texts, word_map):
w_counts = sp.???_matrix((len(texts),len(word_map)))
for n in range(0,len(texts)-1):
for word in re.findall(r"[\w']+", texts[n]):
index = word_map.get(word)
if index != None:
w_counts[n,index] += 1
return w_counts
...
nb = MultinomialNB() #from sklearn
words = features.word_list(texts)
nb.fit(features.word_counts(texts,words), classes)
I want to know what sparse matrix would be best.
I tried using coo_matrix but got an error:
TypeError: ‘coo_matrix’ object has no attribute ‘__getitem__’
I looked at the documentation for COO but was very confused by the following:
Sparse matrices can be used in arithmetic operations …
Disadvantages of the COO format …
does not directly support:
arithmetic operations
I used dok_matrix, and that worked, but I don’t know if this performs best in this case.
Thanks in advance.
Try either
lil_matrixordok_matrix; those are easy to construct and inspect (but in the case oflil_matrix, potentially very slow as each insertion takes linear time). Scikit-learn estimators that accept sparse matrices will accept any format and convert them to an efficient format internally (usuallycsr_matrix). You can also do the conversion yourself using the methodstocoo,todok,tocsretc. onscipy.sparsematrices.Or, just use the
CountVectorizerorDictVectorizerclasses that scikit-learn provides for exactly this purpose.CountVectorizertakes entire documents as input:… while
DictVectorizerassumes you’ve already done tokenization and counting, with the result of that in adictper sample:(The
min_dfargument toCountVectorizerwas added a few releases ago. If you’re using an old version, omit it, or rather, upgrade.)EDIT According to the FAQ, I must disclose my affiliation, so here goes: I’m the author of
DictVectorizerand I also wrote parts ofCountVectorizer.