I’m trying to replicate the example of StratifiedShuffleSplit with X not being an array but a sparse matrix. In the example below, this matrix was created by a DictVectorizer fit to an array of mixed nominal and numerical features.
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import StratifiedShuffleSplit
X = [{"a":1, "b":"xx"}, {"a":2, "b":"yx"}, {"a":2, "b":"yx"}, {"a":1, "b":"xx"}]
y = ["A", "B", "B", "A"]
X = DictVectorizer().fit_transform(X)
y = LabelEncoder().fit_transform(y)
sss = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0)
for train_index, test_index in sss:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
When I run the script, the following error is thrown:
Traceback (most recent call last):
File ".../test.py", line 22, in <module>
X_train, X_test = X[train_index], X[test_index]
TypeError: only integer arrays with one element can be converted to an index
This is because X is not an array but a sparse matrix. So the question is, how can I split the data using this method when X is not an array but a matrix? Perhaps the problem is not scikit-learn specifically, but numpy? Do I have to “transform” train_index and test_index before “applying” them to X? Or maybe I have to “tranform” X instead?
According to the documentation of StratifiedShuffleSplit, for it to work with matrices, I should pass True to the parameter indices, but it doesn’t help.
Any suggestion you could give me would be more than welcome.
The issue is caused by the fact that in your version of scikit-learn
DictVectorizerreturns COO matrix that is not row-wise indexable (the scipy error message is not very explicit unfortunately). To fix the issue convert the vectorized ouput to CSR format by replacing the line:by