I am trying to split a data frame based on a boundary value for a given variable, compute something on both sides of the boundary and output a matrix (preferably a data frame). Example code bellow:
set.seed(1)
tdata <- data.frame(a1=rnorm(100, mean=5, sd=2), a2=rep(0:1, length.out=100))
tall <- sapply(1:9, function(x) {
d <- split(tdata, tdata$a1 <= x)
sapply(d, function (y) {
1 - max(table(y$a2)/nrow(y))
})
})
My result:
> allErr
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
FALSE 0.4949495 0.4895833 0.4943820 0.4933333 0.4444444 0.4411765 0.3333333
TRUE 0.0000000 0.2500000 0.4545455 0.4800000 0.4347826 0.4696970 0.4705882
[,8] [,9]
FALSE 0.5 0.5
TRUE 0.5 0.5
My continuous variable is tdata$a1 and I want to split the data frame every time in 2 using a boundary value from 1:9, perform a computation on a2 for each part of the split, and return that back.
My question here: what is the best way to do this in terms of elegance (looked at a plyr solution but can’t avoid using the first sapply) and more importantly correct usage of other R functions I might not be aware of. I’m also afraid that my solution will not scale that well with much larger data frames than the ones I currently have (~10000 rows).
Nothing more elegant is springing to mind, but this modification may help your solution scale slightly better by splitting an index vector rather than the entire data frame:
The performance gain for this toy example is fairly small, most likely because your data frame only has two columns. If your real data frame has more columns, you’ll see more of a benefit from splitting an index vector.