Another novice question regarding big data. I’m working with a large dataset (3.5m rows) with time series data. I want to create a data.table with a column that finds the first time the unique identifier appears.
df is a data.table, df$timestamp is a date in class POSIXct, and df$id is the unique numeric identifier. I’m using the following code:
# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID
Here’s the catch. I’m aggregating over 80k unique ID’s. R is choking. Anything I can do to optimize my approach?
As mentioned by @Arun, the real key (no pun intended) is the use of proper
data.tablesyntax rather thansetkey.While 80k unique ids sounds like a lot, using the
keyfeature ofdata.tablecan make it a manageable prospect.Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.
Then the
minor another more complex function is just a subset operation: