I’m trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.
The function works if I just call it with test numbers but I can’t figure out how to run it on all rows of my dataset. I’ve tried using apply but can’t get it working.
I’m going to be iterating different parameter settings to find those that best fit historical data so speed is important… meaning that a loop is out. Any help you can provide would be hugely appreciated.
Thanks! Alan
GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
HeightParam <- 1/5000
AgeParam <- 1
Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]
Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2
return(sqrt(sum(Stock_SameType$ED)))
}
HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")
GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number
res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
# returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852, : replacement has 4060 rows, data has 54
# also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) : longer object length is not a multiple of shorter object length
res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
# `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261, : replacement has 4060 rows, data has 13
# also same 4 warnings as res1
I took some liberties with your code b/c I try to vectorize vice use loops whenever I can… With the
mergefunction, you merge the two data frames, and operate on the “columns”, which allows you to use the vectorization built into R. I think this will do what you want (in the second line I’m just making sure thatAandBdon’t have the same values forheightandageso that your distance isn’t always zero):