So Python, with the pandas module seems like a great option to matlab and

Question

0

Asked: June 16, 20262026-06-16T02:06:29+00:00 2026-06-16T02:06:29+00:00

So Python, with the pandas module seems like a great option to matlab and

0

So Python, with the pandas module seems like a great option to matlab and R. This is why I’ve very recently switched to this. There are resources out there, and I’ve searched the forum but couldn’t find anything similar. If you have links to some tutorials or other useful material out there, please post them.

Wes McKinney has a great and elaborate tutorial on pandas.
http://www.youtube.com/watch?v=w26x-z-BdWQ&list=FLJ5xKwlfj7wg8S_A5SgR6Wg&feature=mh_lolz

At 1:10 he shows an example of how to index the rows in a dataframe by dates rather than integers.
I would like to do something similar.

The difference is that I have 3 variables, Y1, Y2, Y3, each with a column of timestamps, X1, X2, X3.

TestFile.txt:  
X1  Y1  X2  Y2  X3  Y3
27/11/2012  11.436  29/11/2012  20.631  4/12/2012   10.209  
28/11/2012  11.468  30/11/2012  20.185  5/12/2012   9.973  
29/11/2012  11.414  3/12/2012   19.962  6/12/2012   9.736  
30/11/2012  11.355  4/12/2012   19.562  7/12/2012   9.509  
3/12/2012   11.309  5/12/2012   18.908  10/12/2012  9.259  
4/12/2012   11.118  6/12/2012   18.288  11/12/2012  8.109  
5/12/2012   10.873  7/12/2012   17.973  
6/12/2012   10.582  10/12/2012  17.788  
7/12/2012   10.264  11/12/2012  17.554  
10/12/2012  9.886  
11/12/2012  9.164

Where I want to do 4 things:

Associate data in Yi by its date in Xi for i = 1,2,3
Index rows by dates
Remove all data that is older than 4/12/2012 which is the first date of Y3
Be able to access all date by date and column only

Here is a test file which describes how the data is read and how it prints.
You can see that X1 is correctly parsed to the pandas date format, but not X2 or X3. which is what I attempted to do by specifying
index_col=[0,2,4]
and
parse_dates = True

TestFile.py:
import pandas as pd

df = pd.read_csv('TestFile.txt',sep='\t', index_col=[0,2,4], parse_dates = True)

print 'pandas version: ', pd.__version__
print df

Gives output:

pandas version:  0.10.0b1
X1         X2         X3              Y1      Y2      Y3                   
2012-11-27 29/11/2012 4/12/2012   11.436  20.631  10.209
2012-11-28 30/11/2012 5/12/2012   11.468  20.185   9.973
2012-11-29 3/12/2012  6/12/2012   11.414  19.962   9.736
2012-11-30 4/12/2012  7/12/2012   11.355  19.562   9.509
2012-03-12 5/12/2012  10/12/2012  11.309  18.908   9.259
2012-04-12 6/12/2012  11/12/2012  11.118  18.288   8.109
2012-05-12 7/12/2012  None        10.873  17.973     NaN
2012-06-12 10/12/2012 None        10.582  17.788     NaN
2012-07-12 11/12/2012 None        10.264  17.554     NaN
2012-10-12 None       None         9.886     NaN     NaN
2012-11-12 None       None         9.164     NaN     NaN

Wanted output:

                Y1      Y2       Y3                 
2012-04-12  11.118  19.562   10.209
2012-05-12  10.873  18.908    9.973
2012-06-12  10.582  18.288    9.736
2012-07-12  10.264  17.973    9.509
2012-10-12   9.886  17.788    9.259
2012-11-12   9.164  17.554    8.109

If you have any idea of how to do this, your help is much appreciated:)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T02:06:30+00:00

I think your confusion is due to a misunderstanding about the index_col argument. When you pass a list of columns to index_col, pandas is attempting to create a multi-index, that is, a dataframe with more than one column as index, like a multi-dimensional table. It is NOT trying to create a single index by concatenating multiple columns.

One strategy that would work is to create three dataframes with the appropriate pairs of columns from your input file, and then concatenate them.

X1 Y1 X2 Y2 X3 Y3 –> Dataframe of (X1, Y1) + Dataframe of (X2, Y2) + Dataframe of (X3, Y3)

If you are using the latest development version of Pandas, or are willing to, this is simplified by using the new parse_cols argument in read_csv(). Or you can read in all the data, extract the three dataframes you need, and then concatenate them.

Finally, you can df.truncate with before and after arguments to get the DateRange you need. More simply, you could use dropna() to omit dates with missing values.

Hope this helps. Do let us know what version of pandas you are using.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So Python, with the pandas module seems like a great option to matlab and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply