I’ve created a R script that calculates the percentage of missing values in each column of a data frame, and then removes the columns that exceed a preset threshold. The column names need to be maintained.
The names are maintained when there is more than one column in the data frame after column deletion, but not when there is only one column.
Code of when column names stay the same
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10), C=rnorm(10, 10, 1))
threshold <- 80
pmiss <- function(x) {
ifelse(sum(is.na(x))/length(x)*100 > threshold, TRUE, FALSE)
}
temp <- sapply(df, pmiss)
deletecols <- names(temp[temp==TRUE])
df <- as.data.frame(df[,!(names(df) %in% deletecols)])
names(df) #prints
[1] "A" "C"
However, define df as
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10))
and
names(df) #prints
[1] "df[, !(names(df) %in% deletecols)]"
Does anybody know why the column names are not kept when there is only one column?
You been bitten by an R FAQ. Add
,drop = FALSEto your data frame subsetting (and you notice as a side-effect that you no longer needas.data.frame.)