I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like “grep” or “which” to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of “match.ind” can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this… it’s really slick. Wondering if it’s useful in other contexts…?!
And that’s it! As a check, let’s look at the first 3 rows of mylist:
There’s a gap at 3, as 3 doesn’t appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
Finally, it’s worth noting that values that appear in X but not in v won’t have an entry in the list, but this is presumably what you want anyway as they’re NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v…
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! 🙂
ORIGINAL POST BELOW… superseded by the above, obviously!
Here’s a toy example with tapply that might well run significantly quicker… I made X and d relatively small so you could see what’s going on: