I have very large dataframe and I need to choose variable number satisfying certain criteria for analysis (for example variables in linear model). The following small data illustrates my data.
set.seed (1234)
mydf <- data.frame (Id = c("dis", 1:5),
V1.a = c(0,sample(c(0, 1,2), 5, replace = T)),V1.b = c(0,sample(c(0, 1,2), 5, replace = T)),
V2.a = c(1.5,sample(c(0, 1,2), 5, replace = T)),V2.b = c(1.5,sample(c(0, 1,2), 5, replace = T)),
V3.a = c(2.0,sample(c(0, 1,2), 5, replace = T)),V3.b = c(2.0,sample(c(0, 1,2), 5, replace = T)),
V4.a = c(5.0,sample(c(0, 1,2), 5, replace = T)),V4.b = c(5.0,sample(c(0, 1,2), 5, replace = T)),
V5.a = c(6.0,sample(c(0, 1,2), 5, replace = T)),V5.b = c(6.0,sample(c(0, 1,2), 5, replace = T)),
V16a = c(11.0,sample(c(0, 1,2), 5, replace = T)),V6.b = c(11.0,sample(c(0, 1,2), 5, replace = T)),
V7.a = c(12.0,sample(c(0, 1,2), 5, replace = T)),V7.b = c(12.0,sample(c(0, 1,2), 5, replace = T)),
V8.a = c(3.0,sample(c(0, 1,2), 5, replace = T)),V8.b = c(3.0,sample(c(0, 1,2), 5, replace = T)))
Printed data:
Id V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b V8.a V8.b
1 dis 0 0 1.5 1.5 2 2 6 6 7 7 11 11 12 12 3 3
2 1 1 2 1.0 1.0 2 0 0 1 2 1 2 0 0 2 0 2
3 2 1 2 2.0 0.0 2 0 2 1 2 1 0 0 2 1 1 1
4 3 2 0 1.0 2.0 1 1 1 0 2 0 1 2 0 2 1 0
5 4 0 1 1.0 1.0 1 1 0 1 0 2 2 2 2 1 0 2
6 5 1 0 2.0 2.0 0 2 1 2 1 1 1 0 2 2 2 2
Here is what I aim to do:
(1) Sort the columns by value in first row – smaller to larger (i.e. dis) except for the ID column. Consider first row while shorting the dataframe
Id V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b V8.a V8.b
1 dis 0 0 1.5 1.5 2 2 6 6 7 7 11 11 12 12 3 3
The columns (variables) are in order except V8.a and V8.b. The shorted data should be in order of:
Id V1.a V1.b V2.a V2.b V3.a V3.b V8.a V8.b V4.a V4.b V5.a V5.b V6a V6.b V7.a V7.b
1 dis 0 0 1.5 1.5 2 2 3 36 6 7 7 11 11 12 12
(2) Then calculate the di
fference between adjacent variables by difference in value in first row. For example, for V1a and V1b the difference is 0 and for V1.b and V2.a the difference is 1.5 – 0 = 1.5
difference between adjacent variables based on the row1
V1.a - V1b V1.b- V2.a V2.a - V2b V2.b - V3.a V3.a - V3.b
0-0 0 - 1.5 1.5 - 1.5 1.5 - 2 2-2
and so on …
(3) Start making models unless the difference calculated as in (2) between adjacent variables will be less than 2. Once the difference is greater than 2, a new model will be created and the process is continued unless end of the data file. The first row will not be included in model.
mydf1 = mydf[-1,]
model1 <- lm(Id ~ V1.a + V1.b + V2.a + V2.b + V3.a + V3.b + V8.a + V8.b, data = mydf1)
model2 <- lm(Id ~ V4.a + V4.b + V5.a + V5.b, data = mydf1)
model3 <- lm(Id ~ V6.a + V6.b + V7.a + V7.b, data = mydf1)
How can I automate this process?
Edits: following the answer
d = mydf
> d
Id V1.a V1.b V2.a V2.b V3.a V3.b V4.a V4.b V5.a V5.b V16a V6.b V7.a V7.b V8.a
1 0 0 0 1.5 1.5 2 2 5 5 6 6 11 11 12 12 3
2 1 1 2 1.0 0.0 1 2 1 2 0 1 1 2 0 1 1
3 2 1 1 2.0 2.0 1 2 0 1 0 1 1 1 0 2 0
4 3 0 0 1.0 2.0 2 0 2 2 2 2 2 1 1 2 0
5 4 0 2 0.0 2.0 2 2 1 1 2 2 2 1 2 1 0
6 5 2 1 2.0 0.0 2 0 1 1 0 1 0 0 0 2 1
V8.b
1 3
2 2
3 0
4 2
5 0
6 2
> d <- d[,order(d[1,-1])]
> d
Id V1.a V1.b V2.a V2.b V3.a V7.b V8.a V3.b V4.a V4.b V5.a V5.b V16a V6.b V7.a
1 0 0 0 1.5 1.5 2 12 3 2 5 5 6 6 11 11 12
2 1 1 2 1.0 0.0 1 1 1 2 1 2 0 1 1 2 0
3 2 1 1 2.0 2.0 1 2 0 2 0 1 0 1 1 1 0
4 3 0 0 1.0 2.0 2 2 0 0 2 2 2 2 2 1 1
5 4 0 2 0.0 2.0 2 1 0 2 1 1 2 2 2 1 2
6 5 2 1 2.0 0.0 2 2 1 0 1 1 0 1 0 0 0
Ordering is not working for V7.a !
Order the columns so that the first row be in ascending
order:Compute the differences
d[1,i] - d[1,i-1](only for the first row):Group the variables into blocks: add them one at a time, and start a new group if the difference computed at the previous step is 2 or more.
Compute the models, in a loop, creating a new data.frame each time: