I have a very large dataframe that I’d like to save a subset of based on a certain subset of a vector. In short I have something like this:
> id<-c("ID1","ID2","ID2","ID3","ID4","ID4","ID4","ID4","ID4")
> status<-c("flag","flag","none","none","flag","flag","flag","none","flag")
> misc1ofmany<-c("etc1","etc2","etc3","etc4","etc5","etc6","etc7","etc8","etc9")
> df = data.frame(id, status, misc1ofmany) ; df
id status misc1ofmany
1 ID1 flag etc1
2 ID2 flag etc2
3 ID2 none etc3
4 ID3 none etc4
5 ID4 flag etc5
6 ID4 flag etc6
7 ID4 flag etc7
8 ID4 none etc8
9 ID4 flag etc9
I’d like to have all the rows of IDs that have been flagged, inclusive of their non-flagged sessions. Right now I’m trying to get the index of the other IDs by grep and plugging that back into a new df. Actually as I wrote this out I figured out that grepl might be easier to work with:
> flaggedIDs <- unique(as.vector(df$id[grep("flag",df$status)]))
> flaggedIDs.allStats.Index <- mapply(grepl,df$id,MoreArgs=list(x=flaggedIDs))
> flaggedIDs.allStats.Index
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
however I just want to get to here:
> flaggedIDsdf <- df[flaggedIDs.allStats.Index] ; flaggedIDsdf
id status misc1ofmany
1 ID1 flag etc1
2 ID2 flag etc2
3 ID2 none etc3
4 ID4 flag etc5
5 ID4 flag etc6
6 ID4 flag etc7
7 ID4 none etc8
8 ID4 flag etc9
I feel like this should be simpler than I am making this out to be, however I’ve tried a number of possibilities to figure this out to no avail. writing out the question helped this become a clearer/simpler problem in my mind (it looks like I’m just missing one step now), but now I’m also wondering if there is a more efficient way of going about this.
data.tableis really useful here for it’s elegant syntax and memory efficiencyOr even more compactly
These two approaches above use the fact that the
icomponent is evaluated first bydata.table.Keying by
idmeans we can do a self-join to extract only thoseidswherestatus=='flag.Or, using
byThis goes through the subsets of
DTbyid, and returns.SD(the data.table of the subset) ifany(status=='flag')(within that subset).