I have a question regarding removing duplicates after sorting within a tuple in R.
Let’s say I have a dataframe of values
df<-cbind(c(1,2,7,8,5,1),c(5,6,3,4,1,8),c(1.2,1,-.5,5,1.2,1))
a and b
a=df[,1]
b=df[,2]
temp<-cbind(a,b)
What I am doing is uniquing based upon a sorted tuple. For example, I want to keep a=1,2,7,8,1 and b=5,6,3,4,8 with the entry a[5] and b[5] removed. This is basically for determining interactions between two objects. 1 vs 5, 2 vs 6 etc. but 5 vs 1 is the same as 1 vs 5, hence I want to remove it.
The route I started to take was as follows. I created a function that sorts each element and put the results back into a vector as such.
sortme<-function(i){sort(temp[i,])}
sorted<-t(sapply(1:nrow(temp),sortme))
and got the following results
a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 5
[6,] 1 8
I then unique the sorted result
unique(sorted)
which gives
a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 8
I also then use !duplicated to get a list of true/false results that I can use in my original dataset to pull out values from another separate column.
T_F<-!duplicated(sorted)
final_df<-df[T_F,]
What I want to know is if I’m going about this the right way for a very large dataset or if there is a built in function to do this already.
Depending on what you mean by “a very large dataset”, you might gain some speed by applying the sorting function only to those rows whose sums are duplicated.