According to Creating an R dataframe row-by-row, it’s not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn’t need to be a data.frame.
According to Creating an R dataframe row-by-row , it’s not ideal to append to
Share
First approach
I tried accessing each element of a pre-allocated data.frame:
But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).
Alternative approach (doesn’t work either)
One approach (not sure it’s faster as I haven’t benchmarked yet) is to create a list of data.frames, then
stackthem all together:Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:
In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it’s possible it’s only that element of the list. I’m not intimately familiar with the details of R’s memory management.
Probably the best approach
As with many speed or memory-limited processes these days, the best approach may well be to use
data.tableinstead of adata.frame. Sincedata.tablehas the:=assign by reference operator, it can update without re-copying:But as @MatthewDowle points out,
set()is the appropriate way to do this inside a loop. Doing so makes it faster still:(Results shown below)
Benchmarking
With the loop run 10,000 times, data table is almost a full order of magnitude faster:
And comparison of
:=withset():Note that
nhere is 10^6 not 10^5 as in the benchmarks plotted above. So there’s an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.