I have a dataset with longitudinal data in a person-oriented format, as such:
pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1 1 1 0 3 2 1
2 0 1 0 2 2 1
...
50k 1 0 1 3 1 0
This results in a large dataframe, with minimum 50k observations and 90 variables measured for up to 29 periods.
I would like to get a more period-oriented format, as such:
pid index start stop varA varB varC ...
1 1 ...
1 2
...
1 29
2 1
I have tried different approaches for reshaping the dataframe (*apply, plyr, reshape2, loops, appending vs. prefilling all numeric matrices, etc.,), but do not seem to get a decent processing time (+40min for subsets). I have picked up various hints along the way on what to avoid, but I’m still not sure if I miss some bottleneck or possible speedup.
Is there an optimal way to approach this kind of data-processing, so that I can evaluate the best-case processing time I can achieve in pure R-code? There have been similar questions on Stackoverflow, but they did not result in convincing answers…
First, let’s build the data example (I am using 5e3 instead of 50e3 to avoid memory problems with my configuration):
And now with
stats::reshapeyou change the format:I am not sure if this is the fast solution you are looking for.