I wish to filter a dataframe, original.data, in R. The dataframe could have around 1-2 million observations. The dataframe has several fields and the names may vary. The user can select which fields to filter by. These field names are stored in names(all.filters), where all.filters is a list of variable length. The user may then choose the levels for each of the fields in names(all.filters). For example, this list may look something like:
> all.filters
$Period
[1] "2010-12-31" "2011-03-31" "2011-06-30" "2011-09-30" "2011-12-31"
[6] "2012-03-31" "2012-06-30" "2012-09-30"
$Size
[1] "L" "VL"
$Number
[1] "11" "21" "35" "42" "45" "47" "49" "52" "57"
I am using the following code to apply the chosen filters:
attach(original.data)
filter.names <- names(all.filters)
flag <- 1
for(filter in filter.names){
flag <- flag*(is.element(get(filter),all.filters[[filter]]))
}
filtered.data <- original.data[flag==1,]
This works, but it feels a little slow. Note that get(filter) retrieves the column of original.data with column name equal to filter. I’m not sure if this is a good way to filter the data, but the variable nature of all.filters limits my choices a bit – I wanted to use subset, but I’m not sure what to put as the select argument. I would like to make this filtering step as fast as possible so that when the user updates a filter selection, the data can be plotted quickly.
Once the data is filtered, I use reshape2 to summarise the data before plotting it with ggplot2. I am thinking that it may be more efficient to apply the filters at one of these steps if possible.
Any suggestions would be greatly appreciated.
You can use a
data.tablewith appropriately set keys. This will be memory efficient.Then you can pass your
listof filters to theicomponent of[.data.tableThe only issue I can see is that you will have to reset the key each time to ensure you are referencing the correct columns. Also, you will need to ensure that the filter identifiers are the same class as the columns in the data.frame — it may be easier to work with
characternotfactorcolumnsEDIT
To filter with more than level on more than one column, use
CJ. CJ is a cross join, (the data.table equivalent of expand.grid, with keys set)You could also do