I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by ‘user’
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn’t take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don’t have any experience with it, but as far as I know, this is one of the issues that package
data.tablewas designed to improve.For most practical purposes,
data.table=data.frame+index. As a consequence, when used right, this improves performance of quite a few large operations.There is a danger that turning your
data.frameinto adata.table(i.e. adding the index) could take some time (although I expect this to be well optimized), but once you’ve got it up, functions like merge can easily use the index for better performance.