Sorry if this question is somewhat specific to the python Scikit-learn library.
I am trying to perform a grid search to find optimal parameter to scikit-learn’s GradientBoostingRegressor. The problem is, I don’t know where to start. In the past I have used R and RStudio setup but I am currenlty trying to migrate to Python for Data Mining and Scikit seems very promising.
Can anyone share possibly some simple setup code they may have used to compute on Amazon EC2 cluster or possibly point to useful example reference for that library for other machine learning algorithm?
Thank you.
As far as I know, GBRT is a pretty sequential algorithm hence there is no trivial way to run it in parallel.
Random forests / ExtraTrees models are embarrassingly parallel, hence would be better candidate for training models on a cluster.
scikit-learn has some builtin support for single machine multiprocessing using joblib (check the docstring of models that accept an
n_jobsargument). We plan to implement a task dispatch framework in joblib at some point instead. Thus we could for instance leverage IPython parallel as a backend to run on a cluster. However there is nothing ready out of the box for this currently.If you are ready to invest some time doing it yourself I would advise you to have a look at StarCluster and its IPython plugin:
http://star.mit.edu/cluster/
http://star.mit.edu/cluster/docs/latest/plugins/ipython.html