I have a large data set
dim(dt)
[1] 422096 162
where dt is a data.table with a key of tic. I am trying to make a measure for each group of how many missing entries I have. The groups are time series, and dt contains a date column, which is an R date, and a book_lev column, my variable of interest.
This is my code so far:
dt <- dt[sumdt]
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
sublengths <- dt[,list(tslen=length(date)),by=tic, mult="last"]
bt2 <- dt[sublengths, mult="first"]
bt2[, max.year:=extractyear(max.date)]
bt2[, min.year:=extractyear(min.date)]
bt2[, data.fullness:=tslen/(max.year - min.year + 1)]
dt <- dt[bt2]
My idea was that I create this data.fullness value which should equal 1 if there are no holes in the time series. I realize that I may have some NA’s in my book_lev column, so I would like to further restrict. Also, in general I am new to data.tables and I would like to see if there are better ways to write what I have just written.
A small sample of the data, which you can load using R’s load command, is available here: http://econsteve.com/r/dt_sample.Robj
(First, a caveat. I’m not sure I correctly understood what you want your
data.fullnessvariable to summarize. Based on the dataset you’ve linked to, I’m taking it to be the proportion of years with some data, in the interval from the first measured year to the last measured year.)Here is the approach I’d take to the problem as I do understand it: