I have to use 2 data frames 2 million records and another 2 million records. I used a for loop to obtain the data from one another but it is too slow. I’ve created an example to demonstrate what I need to do.
ratings = data.frame(id = c(1,2,2,3,3),
rating = c(1,2,3,4,5),
timestamp = c("2006-11-07 15:33:57","2007-04-22 09:09:16","2010-07-16 19:47:45","2010-07-16 19:47:45","2006-10-29 04:49:05"))
stats = data.frame(primeid = c(1,1,1,2),
period = c(1,2,3,4),
user = c(1,1,2,3),
id = c(1,2,3,2),
timestamp = c("2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00"))
ratings$timestamp = strptime(ratings$timestamp, "%Y-%m-%d %H:%M:%S")
stats$timestamp = strptime(stats$timestamp, "%Y-%m-%d %H:%M:%S")
for (i in(1:nrow(stats)))
{
cat("Processing ",i," ...\r\n")
temp = ratings[ratings$id == stats$id[i],]
stats$idrating[i] = max(temp$rating[temp$timestamp < stats$timestamp[i]])
}
Can someone provide me with an alternative for this? I know apply may work but I have no idea how to translate the for function.
UPDATE: Thank you for the help. I am providing more information.
The table stats has unique combinations of primeid,period,user,id.
The table ratings has multiple id records with different ratings and timestamps.
What I want to do is the following. For each id found in stats, to find all the records in the ratings table (id column) and then get the max rating according to a specific timestamp obtained also from stats.
From a data structure perspective it seems that you want to merge two tables and then perform a split-group-apply method.
Instead of for looping to check what row belongs to what row you can simply merge the two tables (much like a JOIN statement in SQL) and then perform an ‘aaply’ type of method. I recommend you download the ‘plyr’ library.
If the use of plyr confuses you, please visit this tutorial: http://www.creatapreneur.com/2013/01/split-group-apply/.