I am trying to figure out an optimal solution (in Java) to the following problem:
In a first pass over some data, I count the number of occurrences of an item. Basically, I create a HashMap from item ID to integer and increment the integer every time I see an occurrence of the item. So basically, I have a Map<Long,Integer> from itemID to count.
Now, what I need from this map is the top n item ids sorted by the count.
Apparently HashMap is not the optimal data structure here. Any ideas?
This is for some data mining stuff I am doing at work, so not a hw problem…
Actually, a HashMap is a reasonable solution here because you have to accumulate the totals. There’s no way you can shortcut that, and no simple way to find the top N items until you know the counts for all items.
After you have the HashMap, there are several ways to do things. If the data is relatively small, create an array of itemId and count pairs, and sort by count in descending order. Then select the top N items.
If you have lots of items (in the hundreds of thousands), it’s probably faster to use a min heap after you get the counts, the idea being that you put the first N items into the min heap and then only insert an item if its count is larger than the smallest item in the min heap.
You could keep things in order by count while you’re going through adding things up, but every time you increment a counter you’ll have to remove the thing from the collection and re-insert it. You’re better off accumulating things in a HashMap where it’s easy to look things up by ID, and then post-process to apply ordering by count.