I have two very large data frames (50MM+ rows) and I need to run some calculations on them. I have developed the following loop, but it runs too slowly. I tried using apply and other methods, but I couldn’t get them to work.
#### Sample Data
df=data.frame(id=1:10,time=Sys.time()-1:10,within5=NA)
df2=data.frame(id2=c(1,1,1,5,5,10),time2=Sys.time()-c(9,5,2,3,4,6))
#### Loop shows how many results from df2 are within 5 secs of the creation of the ID in df
for (i in 1:length(df$id))
{
temp=df2[df2$id==df$id[i],]
df$within5[i]=sum(abs(as.numeric(difftime(temp$time2,df$time[i],units="secs")))<5)
}
To check improvement of procedures, made larger sample data.
Use function
ddply()from packageplyrto divide your data according to columnid2. Then apply your function to each subset.As a result we get new data frame.
If you need column
within5in your original data frame you can use functionmerge().With my sample data this calculation was 10 time faster.