Here is my problem, just hard for me…
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)…
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out…finally decided to bother experts, sorry for that…
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help…
You have the basic idea right, in that you should create a list of data frames and then use
lapplyto apply the function to each element of the list. Unfortunately, there are several oddities in your code.There is no point in randomly generating a seed, then setting it. You only need to use
set.seedin order to make random numbers reproducible. Cut the linesand maybe
rep(1:3, c(rep(3, 3)))is the same asrep(1:3, each = 3).Don’t call your variables
varormatrix,since they will mask the names of those functions.since it’s confusing.3:ncol(x)is dangerous. Ifxhas less than 3 columns it doesn’t do what you think it does.… and now, the problem you actually wanted solving.
The problem is in the line
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).lapplypasses data frames intoanovp, not indicies, soxis a data frame ingdt[x]. Which throws an error.One more thing. While you are rewriting that line, note that
lmtakes a data argument, so you don’t need to do things likegdt$some_column; you can just referencesome_columndirectly.EDIT: Further advice.
You appear to always use the formula
yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call tolapply.I probably wouldn’t bother with the
anovpfunction. Just doThen include a further call to
lapplyto play with the coefficients if necessary. The next line replicates youranovpcode, but won’t work becausemodel$coefficientsis a vector (so the dimensions aren’t right). Adjust to retrieve the bit you actualy want.