I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table?
Sample Data:
dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table?
I’m guessing this needs to be sorted by “id” to line up properly. Luckily that happens automatically when you set the key:
EDIT from Matthew :
This is all correct for v1.8.0 on CRAN. With the small addition that
transforminjis the subject of data.table wiki point 2: “For speed don’ttransform()by group,cbind()afterwards”. But,:=now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):
(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)