I have a large (10000 X 5001) table representing 10000 samples and 5001 different features of these samples. One of these features represents an output variable of each sample. In other words, I have 5000 input variables and one output variable for each sample.
I know that most of these inputs are irrelevant. Therefore, what I would like to do is determine the subset of input variables that predicts the output variable best. What is the best/simplest way to go about doing this in R?
You might want to check out Weka. In the
Explorerload the data and then go to theSelect attributestab. There you will find several options to get the most informative attributes/features in your dataset.