I’m trying to rewrite a function which I have been using for a while. Simplified it’s this:
dat = data.table(dataframe)
getRecentRow <- function(data) {
#Get most recent row (with highest time)
row = data[order(-Time)][1]
return(row)
}
# Run getRecentRow on each chunk given an ID
output = dat[,getRecentRow(.SD), by=ID]
This function gives me the most recent entry (thus with highest Time) per ID. However for each ID it is possible to have multiple entries. These entries can be distinguished with a SUBID. I would like to dig one level deeper and instead of getting the most recent entries per ID, I want the most recent entries per SUBID. Since SUBIDs are not unique, the ID also has to be taken into account. Thus I would like the most recent entry per ID, per SUBID.
Summarizing: The input for the getRecentRow() function should not be subsetted by ID, but by ID and SUBID.
I tried:
dat = data.table(dataframe)
getRecentRow <- function(data) {
#Get most recent row (with highest time)
row = data[order(-Time)][1]
return(row)
}
# Run getRecentRow on each chunk given an ID
output = dat[,getRecentRow(.SD), by=list(ID, SUBID)]
But this returns incorrect output, outputting more rows that required. It should be an easy fix I think reformulating by=list(ID, SUBID) but I can’t find out how.
Problem was not in the function. The function was actually doing its job the whole time. The problem was with the input. The ID number sometimes took a very large value, causing the split for some reason to fail. After converting this number to character. The problem was solved and the function did great.