R Version 2.11.1 32-bit on Windows 7
I get the data train.txt as below:
USER_A USER_B ACTION
1 7 0
1 8 1
2 6 2
2 7 1
3 8 2
And I deal with the data as the algorithm below:
train_data=read.table("train.txt",header=T)
result=matrix(0,length(unique(train_data$USER_B)),2)
result[,1]=unique(train_data$USER_B)
for(i in 1:dim(result)[1])
{
temp=train_data[train_data$USER_B%in%result[i,1],]
result[i,2]=sum(temp[,3])/dim(temp)[1]
}
the result is the score of every USER_B in train_data. the score is defined as:
score of USER_B=(the sum of all the ACTION of USER_B)/(the recommend times of USER_B)
but the train_data is very large, it may take me three days to finish this program, so I come here to ask for help, could this algorithm be improved?
Running your example, your desired result is to calculate the mean ACTION for each unique USER_B:
You can do this with one line of code using the
ddply()function in packageplyrAlternatively, the function
tapplyin base R does the same:Depending on the size of your table, you can get an improvement in execution time of 20x or higher. Here is the system.time test for a data.frame with a million entries. Your algorithm takes 116 seconds, ddply() takes 5.4 seconds, and tapply takes 1.2 seconds: