I am trying to predict test reuslts based on known previous scores. The test is made up of three subjects, each contributing to the final exam score. For all students I have their previous scores for mini-tests in each of the three subjects, and I know which teacher they had. For half of the students (the training set) I have their final score, for the other half I don’t (the test set). I want predict their final score.
So the test set looks like this:
student teacher subject1score subject2score subject3score finalscore
while the test set is the same but without the final score
student teacher subject1score subject2score subject3score
So I want to predict the final score of the test set students. Any ideas for a simple learning algorithm or statistical technique to use?
The simplest and most reasonable method to try is a linear regression, with the teacher and the three scores used as predictors. (This is based on the assumption that the teacher and the three test scores will each have some predictive ability towards the final exam, but they could contribute differently- for example, the third test might matter the most).
You don’t mention a specific language, but let’s say you loaded it into R as two data frames called ‘training.scores
andtest.scores`. Fitting the model would be as simple as using lm:And then the prediction would be done as:
Googling for “R linear regression”, “R linear models”, or similar searches will find many resources that can help. You can also learn about slightly more sophisticated methods such as generalized linear models or generalized additive models, which are almost as easy to perform as the above.
ETA: There have been books written about the topic of interpreting linear regression- an example simple guide is here. In general, you’ll be printing
summary(lm.fit)to print a bunch of information about the fit. You’ll see a table of coefficients in the output that will look something like:The Estimate will give you an idea how strong the effect of that variable was, while the p-values (
Pr(>|T|)) give you an idea whether each variable actually helped or was due to random noise. There’s a lot more to it, but I invite you to read the excellent resources available online.Also
plot(lm.fit)will graphs of the residuals (residuals mean the amount each prediction is off by in your testing set), which tells you can use to determine whether the assumptions of the model are fair.