I try to find a efficient way of finding the first and last line by group.
R) ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),city=c("TU","MI","MI","MI","MI","MI","MI"),code=c(85730,33133,33133,33133,33146,33146,45056))
R) ex
state city code
1: az TU 85730
2: fl MI 33133
3: fl MI 33133
4: fl MI 33133
5: fl MI 33146
6: fl MI 33146
7: oh MI 45056
I would like to find the first and last for each variable of a group
R) ex
state city code first.state last.state first.city last.city first.code last.code
1: az TU 85730 1 1 1 1 1 1
2: fl MI 33133 1 0 1 0 1 0
3: fl MI 33133 0 0 0 0 0 0
4: fl MI 33133 0 0 0 0 0 1
5: fl MI 33146 0 0 0 0 1 0
6: fl MI 33146 0 1 0 1 0 1
7: oh MI 45056 1 1 1 1 1 1
As far as I know data.table cannot easily help for things like this because by="state,city,code" would look at 4 triplets.
The only way I know would be to look for first/last.code in a by=”state,city,code” then first/last.city in a by=”state,city”.
This is what I meant:
applyAll <- function(DT, by){
f<- function(n, vec){ return(vec[1:n]) }
by <- lapply(1:length(by), FUN=f, by)
out <- Reduce(f=firstLast, init=DT, x=by)
return(out)
}
firstLast <- function(DT, by){
addNames <- paste(c("first", "last"),by[length(by)], sep=".")
DT[DT[,list(IDX=.I[1]), by=by]$IDX, addNames[1]:=1]
DT[DT[,list(IDX=.I[.N]), by=by]$IDX, addNames[2]:=1]
return(DT);
}
Result by: applyAll(ex,c("state","city","code")) but this would make NUMEROUS copies of DT, my question is then, is there someting scheduled or already existing such that we cant get first/last by groups. (This is fairly vanilla for SAS or kdb or SQL)
In SAS:
data DT;
set ex;
by state city code;
if first.code then firstcode=1;
if last.code then lastcode=1;
if first.city then firstcity=1;
if last.city then lastcity=1;
if first.state then firststate=1;
if last.state then laststate=1;
run;
If this is the question :
then how about :
But as @Roland commented, there’s probably a better way to achieve your ultimate goal.
And, as requested, here’s what should be a faster solution using
.Iand.N:It should be faster because the grouping is done just once per column, and lots of small vectors are not created (no call to
c()orrep()for each group) unlike the first solution.