I’m trying to replicate the example of StratifiedShuffleSplit with X not being an array

Question

0

Asked: June 12, 20262026-06-12T07:44:55+00:00 2026-06-12T07:44:55+00:00

I’m trying to replicate the example of StratifiedShuffleSplit with X not being an array

0

I’m trying to replicate the example of StratifiedShuffleSplit with X not being an array but a sparse matrix. In the example below, this matrix was created by a DictVectorizer fit to an array of mixed nominal and numerical features.

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import StratifiedShuffleSplit

X = [{"a":1, "b":"xx"}, {"a":2, "b":"yx"}, {"a":2, "b":"yx"}, {"a":1, "b":"xx"}]
y = ["A", "B", "B", "A"]

X = DictVectorizer().fit_transform(X)
y = LabelEncoder().fit_transform(y)

sss = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0)

for train_index, test_index in sss:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

When I run the script, the following error is thrown:

Traceback (most recent call last):
  File ".../test.py", line 22, in <module>
    X_train, X_test = X[train_index], X[test_index]
TypeError: only integer arrays with one element can be converted to an index

This is because X is not an array but a sparse matrix. So the question is, how can I split the data using this method when X is not an array but a matrix? Perhaps the problem is not scikit-learn specifically, but numpy? Do I have to “transform” train_index and test_index before “applying” them to X? Or maybe I have to “tranform” X instead?

According to the documentation of StratifiedShuffleSplit, for it to work with matrices, I should pass True to the parameter indices, but it doesn’t help.

Any suggestion you could give me would be more than welcome.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T07:44:57+00:00

Editorial Team

2026-06-12T07:44:57+00:00Added an answer on June 12, 2026 at 7:44 am

The issue is caused by the fact that in your version of scikit-learn DictVectorizer returns COO matrix that is not row-wise indexable (the scipy error message is not very explicit unfortunately). To fix the issue convert the vectorized ouput to CSR format by replacing the line:

X = DictVectorizer().fit_transform(X)

by

X = DictVectorizer().fit_transform(X).tocsr()

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to replicate the example of StratifiedShuffleSplit with X not being an array

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply