I have two very large data frames (50MM+ rows) and I need to run

Question

0

Asked: June 17, 20262026-06-17T08:56:47+00:00 2026-06-17T08:56:47+00:00

I have two very large data frames (50MM+ rows) and I need to run

0

I have two very large data frames (50MM+ rows) and I need to run some calculations on them. I have developed the following loop, but it runs too slowly. I tried using apply and other methods, but I couldn’t get them to work.

#### Sample Data
df=data.frame(id=1:10,time=Sys.time()-1:10,within5=NA)
df2=data.frame(id2=c(1,1,1,5,5,10),time2=Sys.time()-c(9,5,2,3,4,6))

#### Loop shows how many results from df2 are within 5 secs of the creation of the ID    in df
for (i in 1:length(df$id))
{
temp=df2[df2$id==df$id[i],]
df$within5[i]=sum(abs(as.numeric(difftime(temp$time2,df$time[i],units="secs")))<5)
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T08:56:48+00:00

To check improvement of procedures, made larger sample data.

df=data.frame(id=1:100,time=Sys.time()-1:100)
df2=data.frame(id2=sample(1:100,300000,replace=T),time2=Sys.time()-sample(1:5,300000,replace=T))

Use function ddply() from package plyr to divide your data according to column id2. Then apply your function to each subset.

library(plyr)
df3 <- ddply(df2,"id2",function(x){ 
    data.frame(within5=sum(abs(as.numeric(difftime(x$time2,df$time[df$id==x$id2[1]],units="secs")))<5))})

As a result we get new data frame.

 head(df3)
  id2 within5
1   1    3129
2   2    3032
3   3    2935
4   4    3121
5   5    3042
6   6    2426

If you need column within5 in your original data frame you can use function merge().

df4 <- merge(df,df3,by.x="id",by.y="id2",all=T)

With my sample data this calculation was 10 time faster.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two very large data frames (50MM+ rows) and I need to run

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply