I’ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c("34", "36", "39"), c("45", "46"))
This says that dataframe rows with rownames (not numeric row indeces, but character rownames(dat)) “34”, “36”, “39” constitute one grouping, and “45”, “46” constitute another grouping.
Now I want to pull out the groupings from the dataframe into a parallel list, but my code (below) is really, really slow. How can I speed it up?
> system.time(lapply(rows, function(r) {dat[r, ]}))
user system elapsed
246.09 0.01 247.23
That’s on a very fast computer, R 2.14.1 x64.
One of the main issues is the matching of row names — the default in
[.data.frameis partial matching of row names and you probably don’t want that, so you’re better off withmatch. To speed it up even further you can usefmatchfromfastmatchif you want. This is a minor modification with some speedup:You can get further speed up by not using
[(it is slow for data frames) but splitting the data frame (usingsplit) if yourrowsare non-overlapping and cover all rows (and thus you can map each row to one entry in rows).Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.