I have very large dataframe and I need to choose variable number satisfying certain

Question

0

Asked: May 28, 20262026-05-28T00:49:32+00:00 2026-05-28T00:49:32+00:00

I have very large dataframe and I need to choose variable number satisfying certain

0

I have very large dataframe and I need to choose variable number satisfying certain criteria for analysis (for example variables in linear model). The following small data illustrates my data.

set.seed (1234) 
mydf <- data.frame (Id = c("dis", 1:5),

V1.a = c(0,sample(c(0, 1,2), 5, replace = T)),V1.b = c(0,sample(c(0, 1,2), 5, replace = T)),
V2.a = c(1.5,sample(c(0, 1,2), 5, replace = T)),V2.b = c(1.5,sample(c(0, 1,2), 5, replace = T)),
V3.a = c(2.0,sample(c(0, 1,2), 5, replace = T)),V3.b = c(2.0,sample(c(0, 1,2), 5, replace = T)),
V4.a = c(5.0,sample(c(0, 1,2), 5, replace = T)),V4.b = c(5.0,sample(c(0, 1,2), 5, replace = T)),
V5.a = c(6.0,sample(c(0, 1,2), 5, replace = T)),V5.b = c(6.0,sample(c(0, 1,2), 5, replace = T)),
V16a = c(11.0,sample(c(0, 1,2), 5, replace = T)),V6.b = c(11.0,sample(c(0, 1,2), 5, replace = T)),
V7.a = c(12.0,sample(c(0, 1,2), 5, replace = T)),V7.b = c(12.0,sample(c(0, 1,2), 5, replace = T)),
V8.a = c(3.0,sample(c(0, 1,2), 5, replace = T)),V8.b = c(3.0,sample(c(0, 1,2), 5, replace = T)))

Printed data:

   Id V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b V8.a V8.b
1 dis    0    0  1.5  1.5    2    2    6   6    7    7   11   11   12   12    3    3
2   1    1    2  1.0  1.0    2    0    0    1    2    1    2    0    0    2    0    2
3   2    1    2  2.0  0.0    2    0    2    1    2    1    0    0    2    1    1    1
4   3    2    0  1.0  2.0    1    1    1    0    2    0    1    2    0    2    1    0
5   4    0    1  1.0  1.0    1    1    0    1    0    2    2    2    2    1    0    2
6   5    1    0  2.0  2.0    0    2    1    2    1    1    1    0    2    2    2    2

Here is what I aim to do:

(1) Sort the columns by value in first row – smaller to larger (i.e. dis) except for the ID column. Consider first row while shorting the dataframe

   Id   V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b V8.a V8.b
 1 dis    0    0  1.5  1.5    2    2    6   6    7    7   11   11   12   12    3    3

The columns (variables) are in order except V8.a and V8.b. The shorted data should be in order of:

   Id   V1.a V1.b V2.a V2.b V3.a V3.b V8.a V8.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b 
 1 dis    0    0  1.5  1.5    2    2   3   36   6    7    7   11   11   12   12

(2) Then calculate the di
fference between adjacent variables by difference in value in first row. For example, for V1a and V1b the difference is 0 and for V1.b and V2.a the difference is 1.5 – 0 = 1.5

difference between adjacent variables based on the row1

V1.a - V1b   V1.b- V2.a  V2.a - V2b  V2.b - V3.a    V3.a - V3.b
    0-0      0 - 1.5    1.5 - 1.5     1.5 - 2        2-2

and so on …

(3) Start making models unless the difference calculated as in (2) between adjacent variables will be less than 2. Once the difference is greater than 2, a new model will be created and the process is continued unless end of the data file. The first row will not be included in model.

mydf1 =  mydf[-1,]   
model1 <- lm(Id ~ V1.a + V1.b + V2.a + V2.b + V3.a + V3.b + V8.a + V8.b, data = mydf1)
model2 <- lm(Id ~ V4.a + V4.b +  V5.a +  V5.b, data = mydf1)
model3 <- lm(Id ~ V6.a + V6.b + V7.a + V7.b, data = mydf1)

How can I automate this process?

Edits: following the answer

 d = mydf
> d
  Id V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V16a V6.b V7.a V7.b V8.a
1  0    0    0  1.5  1.5    2    2    5    5    6    6   11   11   12   12    3
2  1    1    2  1.0  0.0    1    2    1    2    0    1    1    2    0    1    1
3  2    1    1  2.0  2.0    1    2    0    1    0    1    1    1    0    2    0
4  3    0    0  1.0  2.0    2    0    2    2    2    2    2    1    1    2    0
5  4    0    2  0.0  2.0    2    2    1    1    2    2    2    1    2    1    0
6  5    2    1  2.0  0.0    2    0    1    1    0    1    0    0    0    2    1
  V8.b
1    3
2    2
3    0
4    2
5    0
6    2
>  d <- d[,order(d[1,-1])]
> d
  Id V1.a V1.b V2.a V2.b V3.a V7.b V8.a V3.b V4.a V4.b V5.a V5.b V16a V6.b V7.a
1  0    0    0  1.5  1.5    2   12    3    2    5    5    6    6   11   11   12
2  1    1    2  1.0  0.0    1    1    1    2    1    2    0    1    1    2    0
3  2    1    1  2.0  2.0    1    2    0    2    0    1    0    1    1    1    0
4  3    0    0  1.0  2.0    2    2    0    0    2    2    2    2    2    1    1
5  4    0    2  0.0  2.0    2    1    0    2    1    1    2    2    2    1    2
6  5    2    1  2.0  0.0    2    2    1    0    1    1    0    1    0    0    0

Ordering is not working for V7.a !

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T00:49:33+00:00

Order the columns so that the first row be in ascending order:

d <- d[,c(1,1+order(d[1,-1]))]

Compute the differences d[1,i] - d[1,i-1] (only for the first row):

d[1,-1] <- c(0, diff( drop(as.matrix(d[1,-1])) ))

Group the variables into blocks: add them one at a time, and start a new group if the difference computed at the previous step is 2 or more.

i <- 1+which( d[1,-1] >= 2 )
i <- data.frame( begin=c(2,i), end=c(i-1,dim(d)[2]) )

Compute the models, in a loop, creating a new data.frame each time:

models <- list()
for(k in 1:dim(i)[1]) {
  tmp <- d[-1, c(1, i$begin[k] : i$end[k])]
  tmp$Id <- as.numeric(as.character(tmp$Id))
  models[[k]] <- lm(Id ~ ., data=tmp)
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have very large dataframe and I need to choose variable number satisfying certain

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply