I want to train a MultiLayerPerceptron using Weka with ~200 samples and 6 attributes.
I was thinking of spliting into train and test, and on train, specify a certain % of the train as Validation set.
But then I considered using fold-crossvalidation in order to make a better use of my set of samples.
My question is: Does it make sense to specify a validation set when doing a crossvalidation approach?
And, considering the size of the sample, can you suggest me some numbers for the two approaches? (e.g. 2/3 for train, 1/3 test, and 20% validation… and for CV: 10-fold, 2-fold, or LOOCV instead…)
Thank you in advance!
Your questions sounds like you’re not exactly familiar with cross-validation. Like you noticed there is a parameter for the number of folds to run. For a simple cross-validation the parameter defines the number of subsets which are created out of your original set. Let that parameter be k. Your original set is splitted into k equally sized subset. Then for each run, the trainig is run on k-1 subsets and the validation is done on the remaining, k-th subset. Then another permutation of k-1 subsets of the k subsets is used for training, and so on. So you run k iterations of this process.
For your data set size, k=10 sounds alright, but basically everything is worth testing, as long as you take all results into account and don’t take the best one.
For the very simple evaluation you just use 2/3 as training set and the 1/3 “test set” is actually your validation set. There are more sophisticated approaches though which use the test set as a termination criterion and another validation set as the final evaluation (since your results might be overfitted to the test set as well, because it defines the termination). For this approach you obviously need to split up the set differently (e.g. 2/3 training, 3/12 test and 1/12 validation).