With the recent introduction of the package dataframe, I thought it was time to properly benchmark the various data structures and to highlight what each is best at. I’m no expert at the different strengths of each, so my question is, how should we go about benchmarking them.
Some (rather crude) things I have tried:
library(microbenchmark)
library(data.table)
mat <- matrix(rnorm(10000), nrow = 100)
mat2df.base <- data.frame(mat)
library(dataframe)
mat2df.dataframe <- data.frame(mat)
mat2dt <- data.table(mat)
bm <- microbenchmark(t(mat), t(mat2df.base), t(mat2df.dataframe), t(mat2dt), times = 1000)
Results:
Unit: microseconds
expr min lq median uq max
1 t(mat) 20.927 23.210 31.201 36.908 951.591
2 t(mat2df.base) 929.903 974.039 997.439 1040.814 28270.717
3 t(mat2df.dataframe) 924.957 969.093 992.683 1025.404 27255.205
4 t(mat2dt) 1749.465 1817.382 1857.903 1909.649 5347.321
I’m no data.table expert, but from what I understand its primary advantage is in indexing. So try subsetting with the various packages to compare speeds.
With a sufficiently large matrix, data.table wipes the table with the other options.
Also, I suspect that @RJ- ‘s attempt to compare the performance of base data.frame with the package
dataframe‘s data.frames is not working. The performances are just too similar, and I suspect the results are those of the loaded library not of base.Edit: Tested. Doesn’t seem to make much of a difference. bm.after is the same code as bm.subset above, just run at the same time as bm.before to provide an accurate comparison.