I’m migrating from data frames and matrices to data tables, but haven’t found a solution for extracting the unique rows from a data table. I presume there’s something I’m missing about the [,J] notation, though I’ve not yet found an answer in the FAQ and intro vignettes. How can I extract the unique rows, without converting back to data frames?
Here is an example:
library(data.table)
set.seed(123)
a <- matrix(sample(2, 120, replace = TRUE), ncol = 3)
a <- as.data.frame(a)
b <- as.data.table(a)
# Confirm dimensionality
dim(a) # 40 3
dim(b) # 40 3
# Unique rows using all columns
dim(unique(a)) # 8 3
dim(unique(b)) # 34 3
# Unique rows using only a subset of columns
dim(unique(a[,c("V1","V2")])) # 4 2
dim(unique(b[,list(V1,V2)])) # 29 2
Related question: Is this behavior a result of the data being unsorted, as with the Unix uniq function?
Before data.table v1.9.8, the default behavior of
unique.data.tablemethod was to use the keys in order to determine the columns by which the unique combinations should be returned. If thekeywasNULL(the default), one would get the original data set back (as in OPs situation).As of data.table 1.9.8+,
unique.data.tablemethod uses all columns by default which is consistent with theunique.data.framein base R. To have it use the key columns, explicitly passby = key(DT)intounique(replacingDTin the call to key with the name of the data.table).Hence, old behavior would be something like
While for data.table v1.9.8+, just
Or without a copy