I’m using the excellent read_csv()function from pandas, which gives:
In [31]: data = pandas.read_csv("lala.csv", delimiter=",")
In [32]: data
Out[32]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12083 entries, 0 to 12082
Columns: 569 entries, REGIONC to SCALEKER
dtypes: float64(51), int64(518)
but when i apply a function from scikit-learn i loose the informations about columns:
from sklearn import preprocessing
preprocessing.scale(data)
gives numpy array.
Is there a way to apply scikit or numpy function to DataFrames without loosing the information?
A (slightly naive) way would be to store the structure of your data frame, i.e. its columns and index, separately, and then create a new data frame from your preprocessed results like so:
As you can see in
Out[22], we start off with a data frame, and then inIn[29]we place some new data inside the frame, leaving the rows and columns unchanged. I am assuming your preprocessing willnotshuffle the rows/ columns of the data.