I’m running daily simulations in a batch: I do 365 simluations to get results for a full year. After every run, I want to extract some arrays from the results and add them to a pandas.DataFrame for analysis later.
I have a rough model (doing an optimisation) and a more precise model for a post-simulation, so I can get the same variable from two sources. In case the post-simulation is done, the results may overwrite the optimization results.
To make it more complicated, the optimization model has a smaller output interval, depending on the discretisation settings, but the final analysis will happen on the larger interval of the post-simulation).
What is the best way to construct this DataFrame?
This was my first appraoch:
- creation of an empty
DataFramedffor the full year, withDateRangeindex with the larger post- simulation interval (=15 minutes) - do optimization for 1 day ==> create temporary
df_tempwithDateRangeas index with smaller interval - downsample this
DataFrameto 15 minutes as described here: - update
dfwithdf_temp(rows indfare still empty, except for the last row of the previous run, so I have to takedf_temp[1:]) - do simulation for same day ==> create temporary
df_temp2with interval = 15min - overwrite the corresponding rows in
dfwithdf_temp2
Which methods should I use in step 4) and 6)? Or is there a better way from the start?
Thanks,
Roel
I think that using
DataFrame.combine_firstcould be the way to go, but depending on the scale of the data, it might be more useful to have a method like “update” that just modified particular rows in an existing DataFrame.combine_firstis more general and can cause the result to be of a different size than either of the inputs (because the indexes will get unioned together).https://github.com/pydata/pandas/issues/961