Lets say you’re running a movie database website like IMDb/Netflix and users rate each movie from 1-10 star. When a user rate movie, I get id (long) and rating from 1-10 in the request. The Movie class looks like this.
class Movie
{
long id;
String name;
double avgRating; //Avg Rating of this movie
long numberOfRatings; //how many times this movie was rated.
}
public void updateRating(long movieId, int rating)
{
//code to update movie rating and update top 10 movie to show on page.
}
My question is what data structures I can choose to keep huge movies data in memory so that on each updateRating call, i update movie rating as well as update Top 10 movie and reflect on the webpage and users will always see the latest top 10 movies. I have a lot of space on web server and i can keep all the movies objects in memory. The challenges here are
1) Look up a movie by id.
2) update movie rating.
3) choose new location of this movie in the sorted collection of movies (sorted by ratings)
and if its new position is in first top 10, show it on web page.
All these operations should be done in best optimal time.
this is not a homework but a general programming and data structure question.
It seems like there are two parallel structures here. First, you need a lookup table that can map from IDs to movies. Second, you need to maintain some sort of priority queue that can be used to track the top ten movies overall.
One way to solve this problem would be to simply maintain these two structures concurrently. Since you know that each movie has an integral ID, you could either store the movies in a giant array, or if you expect the IDs to be sparse in a hash table. Additionally, you could maintain a priority queue (perhaps backed by a binary or binomial heap) that stores all movies with priority equal to their rating. This would allow you to determine the top ten movies by dequeuing ten elements from the priority queue and then reinserting them.
However, to squeeze more performance out of your priority queue, I’d suggest using a slightly modified queue structure in which you have an array of the top ten movies in sorted order and a priority queue of all other movies that are not in the top ten. Whenever you update the priority of a movie, you could do the following:
If the movie is in the top-ten array, remove it from that array and shuffle the elements after it up one spot. Then insert it into the priority queue with its new rating.
Otherwise, use the priority queue’s decrease-key function to reduce its key. If the rating is now higher than the tenth-most popular movie in the top ten list, remove that movie from the top ten list and insert it into the priority queue. Otherwise, we are done.
(At this point, the element is now in the priority queue at its proper location, and the top ten movies array has nine elements in it)
Use the priority queue’s dequeue-max function to extract the most popular movie from the priority queue, then use a simple insertion sort to insert it into the array of the top ten most popular movies.
The overall time complexity for this approach (assuming you use a binary or binomial heap) is O(k2 + lg n), where k is the number of elements in the top-ten list and n is the total number of movies. On average, it runs in O(lg n) time, since chances are you don’t need to update the top ten list. In either case, since k is small (ten), I’d assume that this would work very quickly. Moreover, it gives you O(1) lookup for any of the top k movies, which I expect will be a pretty common operation.
Hope this helps!