So Python, with the pandas module seems like a great option to matlab and R. This is why I’ve very recently switched to this. There are resources out there, and I’ve searched the forum but couldn’t find anything similar. If you have links to some tutorials or other useful material out there, please post them.
Wes McKinney has a great and elaborate tutorial on pandas.
http://www.youtube.com/watch?v=w26x-z-BdWQ&list=FLJ5xKwlfj7wg8S_A5SgR6Wg&feature=mh_lolz
At 1:10 he shows an example of how to index the rows in a dataframe by dates rather than integers.
I would like to do something similar.
The difference is that I have 3 variables, Y1, Y2, Y3, each with a column of timestamps, X1, X2, X3.
TestFile.txt:
X1 Y1 X2 Y2 X3 Y3
27/11/2012 11.436 29/11/2012 20.631 4/12/2012 10.209
28/11/2012 11.468 30/11/2012 20.185 5/12/2012 9.973
29/11/2012 11.414 3/12/2012 19.962 6/12/2012 9.736
30/11/2012 11.355 4/12/2012 19.562 7/12/2012 9.509
3/12/2012 11.309 5/12/2012 18.908 10/12/2012 9.259
4/12/2012 11.118 6/12/2012 18.288 11/12/2012 8.109
5/12/2012 10.873 7/12/2012 17.973
6/12/2012 10.582 10/12/2012 17.788
7/12/2012 10.264 11/12/2012 17.554
10/12/2012 9.886
11/12/2012 9.164
Where I want to do 4 things:
-
Associate data in Yi by its date in Xi for i = 1,2,3
-
Index rows by dates
-
Remove all data that is older than 4/12/2012 which is the first date of Y3
-
Be able to access all date by date and column only
Here is a test file which describes how the data is read and how it prints.
You can see that X1 is correctly parsed to the pandas date format, but not X2 or X3. which is what I attempted to do by specifying
index_col=[0,2,4]
and
parse_dates = True
TestFile.py:
import pandas as pd
df = pd.read_csv('TestFile.txt',sep='\t', index_col=[0,2,4], parse_dates = True)
print 'pandas version: ', pd.__version__
print df
Gives output:
pandas version: 0.10.0b1
X1 X2 X3 Y1 Y2 Y3
2012-11-27 29/11/2012 4/12/2012 11.436 20.631 10.209
2012-11-28 30/11/2012 5/12/2012 11.468 20.185 9.973
2012-11-29 3/12/2012 6/12/2012 11.414 19.962 9.736
2012-11-30 4/12/2012 7/12/2012 11.355 19.562 9.509
2012-03-12 5/12/2012 10/12/2012 11.309 18.908 9.259
2012-04-12 6/12/2012 11/12/2012 11.118 18.288 8.109
2012-05-12 7/12/2012 None 10.873 17.973 NaN
2012-06-12 10/12/2012 None 10.582 17.788 NaN
2012-07-12 11/12/2012 None 10.264 17.554 NaN
2012-10-12 None None 9.886 NaN NaN
2012-11-12 None None 9.164 NaN NaN
Wanted output:
Y1 Y2 Y3
2012-04-12 11.118 19.562 10.209
2012-05-12 10.873 18.908 9.973
2012-06-12 10.582 18.288 9.736
2012-07-12 10.264 17.973 9.509
2012-10-12 9.886 17.788 9.259
2012-11-12 9.164 17.554 8.109
If you have any idea of how to do this, your help is much appreciated:)
I think your confusion is due to a misunderstanding about the
index_colargument. When you pass a list of columns toindex_col, pandas is attempting to create a multi-index, that is, a dataframe with more than one column as index, like a multi-dimensional table. It is NOT trying to create a single index by concatenating multiple columns.One strategy that would work is to create three dataframes with the appropriate pairs of columns from your input file, and then concatenate them.
X1 Y1 X2 Y2 X3 Y3 –> Dataframe of (X1, Y1) + Dataframe of (X2, Y2) + Dataframe of (X3, Y3)
If you are using the latest development version of Pandas, or are willing to, this is simplified by using the new
parse_colsargument inread_csv(). Or you can read in all the data, extract the three dataframes you need, and then concatenate them.Finally, you can
df.truncatewithbeforeandafterarguments to get the DateRange you need. More simply, you could usedropna()to omit dates with missing values.Hope this helps. Do let us know what version of pandas you are using.