I’ve been looking at building a ‘people who like x, also like y’ type recommendation system, and was looking at using Vogoo, but after looking through their code it seems there is a lot of nearest neighbor based on ratings.
Over the last few weeks I’ve seen a few articles stating that most people either don’t rate at all, or rate a 5 http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html
I don’t currently have a ratings system implemented, and I don’t really see the need to implement it if all the applicable ratings don’t fluctuate.
Does this mean that KNN isn’t really valuable?
Does anybody have any recommendations for developing a system to get recommendations of similar likeness based on past viewing history (passive filtering)?
The data I’m working with is event based, so if you’ve looked at mens doubles-tennis, blue jays baseball, college womens basket ball, etc. I’d recommend other events that are currently in your area which others who looked at similar events across the entire system have also viewed.
I mostly work with PHP, but have been starting to learn Python (and probably need to learn Java, if that helps).
Well, the curt answer to your first question would be no. If you have no variation in your data (YouTube stars), it’s difficult to make a recommendation.
What I might suggest is trying to expand the amount of data you have. For the YouTube example, instead of just looking at the star ratings, also consider the percentage of the video that was watched. Lots of pausing, seeking, rewinding might mean that the user liked the video and wanted to see parts more often, so it should get a boost from that.
The standard way of doing recommendation, at least in the music world, is to come up with a distance metric that you can use, which gives you a distance between any two pieces of music. Then when you find out the type of music a user likes, you can pick one that’s similar to their tastes by picking songs that are “close” according to the distance metric. They are also called similarity matrices, where two items with high distance would have low similarity.
So the question comes down to how you generate these similarities. One way you could do it would be to count how many people that watched show A also watched show B. If you do this for every pair of events, you’ll be able to make recommendations from the corpus you’ve analyzed. Unfortunately, this doesn’t extend well to making recommendations for events where you don’t already know how many people watched them (live events instead of recorded ones).
This is at least a start though.