I would like to import data from a CSV file to use in scikit-learn. It has a mix of numerical data categorical data, e.g.
someValue,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
I need to convert this representation into a purely numerical one where categorical data points get converted into multiple binary columns, e.g.
someValue,colorIsRed,colorIsBlue,someOtherValue
1.2,1,0,55.6
1.9,0,1,20.5
3.2,1,0,16.5
Is there any utility that does this for me, or an easy way to iterate through the data and get this representation?
scikit-learn doesn’t offer data-loading functions as far as I know, but it does prefer Numpy arrays as input. Numpy’s loadtxt function together with its
convertersparameter can be used to load your csv and specify the types of each column. It does not binarize your second column though.