How do I select all the rows that have a missing value in the primary key in a data table.
DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)
Selecting for a particular value is easy
DT["a",]
Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct?
DT[NA,]# does not work
DT[is.na(x),] #does work
Fortunately,
DT[is.na(x),]is nearly as fast as (e.g.)DT["a",], so in practice, this may not really matter much:===
Addition from Matthew (won’t fit in comment) :
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren’t too bad, but on 2 or more columns it quickly degrades.
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here’s some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that
NA_integer_is internally a negative integer. That trips up radix/counting sort (iirc) resulting insetkeygoing slower. But it’s on the list to revisit.