Is there a rule-of-thumb for how to best divide data into training and validation

Question

0

Asked: June 15, 20262026-06-15T12:53:41+00:00 2026-06-15T12:53:41+00:00

Is there a rule-of-thumb for how to best divide data into training and validation

0

Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?

I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T12:53:42+00:00

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you’re probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn’t really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

Split your data into training and testing (80/20 is indeed a good starting point)
Split the training data into training and validation (again, 80/20 is a fair split).
Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Is there a rule-of-thumb for how to best divide data into training and validation

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply