Edit 2019: This question was asked prior to changes in data.table in November 2016, see the accepted answer below for both the current and previous methods.
I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')]) but this doesn’t work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.
Any suggestions?
Cheers,
Davy
Example
>dt
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] A B
[5,] B A
[6,] C D
[7,] C D
[8,] E F
[9,] G G
[10,] A B
in the above data.table where V2 is the table key, only rows 4,7, and 10 would be removed.
dt <- data.table::data.table(
V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"),
V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"),
)
For v1.9.8+ (released November 2016)
From
?unique.data.tableBy default all columns are being used (which is consistent with
?unique.data.frame)Or using the
byargument in order to get unique combinations of specific columns (like previously keys were used for)Prior v1.9.8
From
?unique.data.table, it is clear that callinguniqueon a data table only works on the key. This means you have to reset the key to all columns before callingunique.Calling
uniquewith one column as key: