I am attempting to understand the logic in the data.table from the documentation and a bit unclear. I know I can just try this and see what happens but I would like to make sure that there is no pathological case and therefore would like to know how the logic was actually coded. When two data.table objects have a different number of key columns, for example a has 2 and b has 3, and you run c <- a[b], will a and b be merged simply on the first two key columns or will the third column in a be automatically merged to the 3rd key column in b? An example:
require(data.table)
a <- data.table(id=1:10, t=1:20, v=1:40, key=c("id", "t"))
b <- data.table(id=1:10, v2=1:20, key="id")
c <- a[b]
This should select rows of a that match the id key column in b. For example, for id==1 in b, there are 2 rows in b and 4 rows in a that should generate 8 rows in c. This is indeed what seems to happen:
> head(c,10)
id t v v2
1: 1 1 1 1
2: 1 1 21 1
3: 1 11 11 1
4: 1 11 31 1
5: 1 1 1 11
6: 1 1 21 11
7: 1 11 11 11
8: 1 11 31 11
9: 2 2 2 2
10: 2 2 22 2
The other way to try it is to do:
d <-b[a]
This should do the same thing: for every row in a it should select the matching row in b: since a has an extra key column, t, that column should not be used for matching and a join based only on the first key column, id should be done. It seems like this is the case:
> head(d,10)
id v2 t v
1: 1 1 1 1
2: 1 11 1 1
3: 1 1 1 21
4: 1 11 1 21
5: 1 1 11 11
6: 1 11 11 11
7: 1 1 11 31
8: 1 11 11 31
9: 2 2 2 2
10: 2 12 2 2
Can someone confirm? To be clear: is the third key column of a ever used in any of the merges or does data.table only use the min(length(key(DT))) of the two tables.
Good question. First the correct terminology is (from
?data.table) :So “key” (singlular) not “keys” (plural). We can get away with “keys”, currently. But when secondary keys are added in future, there may then be multiple keys. Each key (singular) can have multiple columns (plural).
Otherwise you’re absolutely correct. The following paragraph was improved in v1.8.2 based on feedback from others also confused. From
?data.table:Following comments, in v1.8.3 (on R-Forge) this now reads (changes in bold) :