I have a large data set dim(dt) [1] 422096 162 where dt is a

Question

0

Asked: May 27, 20262026-05-27T11:58:07+00:00 2026-05-27T11:58:07+00:00

I have a large data set dim(dt) [1] 422096 162 where dt is a

0

I have a large data set

 dim(dt)
 [1] 422096    162

where dt is a data.table with a key of tic. I am trying to make a measure for each group of how many missing entries I have. The groups are time series, and dt contains a date column, which is an R date, and a book_lev column, my variable of interest.

This is my code so far:

dt <- dt[sumdt]
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]

sublengths <- dt[,list(tslen=length(date)),by=tic, mult="last"]
bt2 <- dt[sublengths, mult="first"]
bt2[, max.year:=extractyear(max.date)]
bt2[, min.year:=extractyear(min.date)]
bt2[, data.fullness:=tslen/(max.year - min.year + 1)]

dt <- dt[bt2]

My idea was that I create this data.fullness value which should equal 1 if there are no holes in the time series. I realize that I may have some NA’s in my book_lev column, so I would like to further restrict. Also, in general I am new to data.tables and I would like to see if there are better ways to write what I have just written.

A small sample of the data, which you can load using R’s load command, is available here: http://econsteve.com/r/dt_sample.Robj

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T11:58:08+00:00

(First, a caveat. I’m not sure I correctly understood what you want your data.fullness variable to summarize. Based on the dataset you’ve linked to, I’m taking it to be the proportion of years with some data, in the interval from the first measured year to the last measured year.)

Here is the approach I’d take to the problem as I do understand it:

## FIRST, DEFINE A COUPLE OF FUNCTIONS

extractYear <- function(X) {
    as.numeric(format(as.Date(X, format="%m/%d/%Y"), "%Y"))
}

calcFullness <- function(YRS) {
    length(unique(YRS))/(diff(range(YRS))+1)
}

## THEN SET TO WORK ON YOUR DATA.TABLE

key(dt) <- "tic"
dt[, year:=extractYear(datadate)]

# Extract summaries for each level of tic
ticSumm <- 
    dt[, list(min.year = min(year),
              max.year = max(year),
              data.fullness = calcFullness(year)), by=tic]
ticSumm
#       tic min.year max.year data.fullness
# [1,] AMZN     1995     2010             1
# [2,]   GM     1950     2010             1
# [3,]  XOM     1950     2010             1


# Merge summary back into dt
dt <- dt[ticSumm]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large data set dim(dt) [1] 422096 162 where dt is a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply