I have a large data.frame that was generated by a process outside my control,

Question

0

Asked: May 28, 20262026-05-28T03:17:50+00:00 2026-05-28T03:17:50+00:00

I have a large data.frame that was generated by a process outside my control,

0

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here’s the function I’m currently using to remove such variables from the data.frame. It’s currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

And here’s the result of the process:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T03:17:50+00:00

Don’t use table() – very slow for such things. One option is length(unique(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon’s solution here is similarly quick on this example:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you’ll have to see if they scale similarly to real problem sizes.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large data.frame that was generated by a process outside my control,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply