Another novice question regarding big data. I’m working with a large dataset (3.5m rows)

Question

0

Asked: June 18, 20262026-06-18T06:04:09+00:00 2026-06-18T06:04:09+00:00

Another novice question regarding big data. I’m working with a large dataset (3.5m rows)

0

Another novice question regarding big data. I’m working with a large dataset (3.5m rows) with time series data. I want to create a data.table with a column that finds the first time the unique identifier appears.

df is a data.table, df$timestamp is a date in class POSIXct, and df$id is the unique numeric identifier. I’m using the following code:

# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID

Here’s the catch. I’m aggregating over 80k unique ID’s. R is choking. Anything I can do to optimize my approach?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T06:04:11+00:00

As mentioned by @Arun, the real key (no pun intended) is the use of proper data.table syntax rather than setkey.

df[, min(timestamp), by=id]

While 80k unique ids sounds like a lot, using the key feature of data.table can make it a manageable prospect.

setkey(df, id)

Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.

set.seed(1)
dat <- data.table(x = sample(1:10, 10), y = c('a', 'b'))

    x y
 1:  3 a
 2:  4 b
 3:  5 a
 4:  7 b
 5:  2 a
 6:  8 b
 7:  9 a
 8:  6 b
 9: 10 a
10:  1 b

setkey(dat, y, x)

     x y
 1:  2 a
 2:  3 a
 3:  5 a
 4:  9 a
 5: 10 a
 6:  1 b
 7:  4 b
 8:  6 b
 9:  7 b
10:  8 b

Then the min or another more complex function is just a subset operation:

dat[, .SD[1], by=y]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Another novice question regarding big data. I’m working with a large dataset (3.5m rows)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply