I have a data frame and I’m trying to take a factor variable and keep only the top 31 levels and make all the other levels some generic level.
I need to do this across several vectors so I figured I’d create function, but I’m not having much luck. I think I need to somehow use mapply or Vectorize but I don’t think I’m doing it properly as I get error messages about being unable to allocate 3.6 gigs of memory.
This is the function where x is the vector and topCount is the number of levels to keep
createFactor <-function(x, topCount){
table1 <- data.frame(table(x))
table1 <- table1[order(-table1$Freq),]
noChange <- table1$Var1[1:topCount]
newVals1 <- factor(ifelse(x %in% noChange, x, "-1000"))
newVals1
}
I’d like to be able to write something like this:
df1$topLevels <- createFactor(df1$fact1, 31)
Any suggestions ?
I’m not completely certain about the performance characteristics of this, but I probably would have written this function more like so:
where I’ve used
tabulaterather thantablebecause I suspect is may be faster (which seems important in your case) although I haven’t tested this to see how much faster it would actually be.