I am a beginner in scikits and svm and I would like to check a couple of questions. I have a sample of 700 items and 35 features and I have 3 classes. I have an array X with my samples and features that are scaled using the “preprocessing.scale(X)”.
The first step is to find the suitable SVM parameters and I am using the grid search with nested cross validation (see http://scikit-learn.org/stable/auto_examples/grid_search_digits.html#).
I am using all my samples (X) in the “grid search”. During the grid search, the data is split into training and testing (using StratifiedKFold).
When I get my SVM parameters, I perform the classification where I divide my data into training and testing.
Is it ok to use the same data in the grid search that I will be using during the real classification?
I am a beginner in scikits and svm and I would like to check
Share
It is ok to use this data for training (fitting) a classifier. Cross validation, as done by
StratifiedKFold, is intended for situations where you don’t have enough data to hold out a validation set while optimizing the hyperparameters (the algorithm settings). You can also use if you’re too lazy to make a validation set splitter and want to rely on scikit-learn’s built-in cross validation 🙂The
refitoption toGridSearchCVwill retrain the estimator on the full training set after finding the optimal settings with cross validation.It is, however, senseless to apply a trained classifier to the data you grid searched or trained on, since you already have the labels. If you want to do formal evaluation of a classifier, you should hold out a test set from the very beginning and not touch that again until you’ve done all your grid searching, validation and fitting.