I have a large dataframe. For some purposes I need to do the following:
- Select one column in this data frame
- Iterate on all rows of a given data frame except selected column
- Select all rows of this data frame that are equal by all elements except one selected column
- Group them by the way that group name is the row index and group values are indexes of duplicated rows.
I have wrote a function for this task, but it works slow because of nested loop. I would like to get some ideas how this code can be improved.
Say we have a dataframe like this:
V1 V2 V3 V4
1 1 2 1 2
2 1 2 2 1
3 1 1 1 2
4 1 1 2 1
5 2 2 1 2
And we want to get this list as a output:
diff.dataframe("V2", conf.new, conf.new)
Ouput:
$`1`
[1] 1
$`2`
[1] 2
$`3`
[1] 1 3
$`4`
[1] 2 4
$`5`
[1] 5
The following code reaces the goal, but it works too slow. Is it possible to improve it somehow?
diff.dataframe <- function(param, df1, df2){
excl.names <- c(param)
df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
list.out <- list()
for (i in 1:nrow(df1.excl)){
for (j in 1:nrow(df2.excl)){
if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
if (!as.character(i) %in% unlist(list.out)){
list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)
}
}
}
}
return(list.out)
}
Let’s generate some data first
We then loop through the lines with
lapply. Each rowiis then compared to all lines ofdfwithapply(including itself). The rows with <= 1 differences returnsTRUE, the others returnFALSEproducing a logical vector, which we convert to a numeric vector withwhich.