This is my first time trying Pandas. I think I have a reasonable use case, but I am stumbling. I want to load a tab delimited file into a Pandas Dataframe, then group it by Symbol and plot it with the x.axis indexed by the TimeStamp column. Here is a subset of the data:
Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM
Note two things about the TimeStamp column;
- it has duplicate values and
- the intervals are irregular.
I thought I could do something like this…
from pandas import *
import pylab as plt
df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)
df.plot()
plt.show()
But the read_csv method raises an exception “Tried columns 1-X as index but found duplicates”. Is there an option that will allow me to specify an index column with duplicate values?
I would also be interested in aligning my irregular timestamp intervals to one second resolution, I would still wish to plot multiple events for a given second, but maybe I could introduce a unique index, then align my prices to it?
I created several issues just now to address some features / conveniences that I think would be nice to have: GH-856, GH-857, GH-858
We’re currently working on a revamp of the time series capabilities and doing alignment to secondly resolution is possible now (though not with duplicates, so would need to write some functions for that). I also want to support duplicate timestamps in a better way. However, this is really panel (3D) data, so one way that you might alter things is the following:
note that this created a MultiIndex. Another way I could have gotten this:
you can get this data in a true panel format like so:
you can then generate some plots from that data.
note here that the timestamps are still as strings– I guess they could be converted to Python datetime.time objects and things might be a bit easier to work with. I don’t have many plans to provide a lot of support for raw times vs. timestamps (date + time) but if enough people need it I suppose I can be convinced 🙂
If you have multiple observations on a second for a single symbol then some of the above methods will not work. But I want to build in better support for that in upcoming releases of pandas, so knowing your use cases will be helpful to me– consider joining the mailing list (pystatsmodels)