I have two sets of temperature date, which have readings at regular (but different) time intervals. I’m trying to get the correlation between these two sets of data.
I’ve been playing with Pandas to try to do this. I’ve created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the two timeSeries do not match up exactly (they’re generally off by seconds), I get Null as an answer. I could get a decent answer if I could:
a) interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don’t know how to do it)
b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I’d lose a degree of accuracy, but not a huge amount
c) use something else in Pandas to get the correlation between two timeSeries
d) use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.
Anyone have any suggestions?
You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don’t occur at the same instants.
Use the values “as of” the times in one of the time series, here’s an example:
you can see these are off by 30 seconds. The
reindexfunction enables you to align data while filling forward values (getting the “as of” value):note that ‘pad’ is also aliased by ‘ffill’ (but only in the very latest version of pandas on GitHub as of this time!).
Strip seconds out of all your datetimes. The best way to do this is to use
renameNote that if rename causes there to be duplicate dates an
Exceptionwill be thrown.For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):
These last code snippets may not work if you don’t have the latest code from https://github.com/wesm/pandas. If
.mean()doesn’t work on aGroupByobject per above try.agg(np.mean)Hope this helps!