In Java, I need an algorithm to find the max. number of occurrences in a collection of integers. For example, if my set is [2,4,3,2,2,1,4,2,2], the algorithm needs to output 5 because 2 is the mostly occurring integer and it appears 5 times. Consider it like finding the peak of the histogram of the set of integers.
The challenge is, I have to do it one by one for multiple sets of many integers so it needs to be efficient. Also, I do not know which element will be mostly appearing in the sets. It is totally random.
I thought about putting those values of the set into an array, sorting it and then iterating over the array, counting consecutive appearances of the numbers and identifying the maximum of the counts but I am guessing it will take a huge time. Are there any libraries or algorithms that could help me do it efficiently?
I would loop over the collection inserting into a Map datastructure with the following logic:
There are two Maps in Java you could use – HashMap and TreeMap – these are compared below:
HashMap vs. TreeMap
You can skip the detailed explanation a jump straight to the summary if you wish.
A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:
Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.
As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:
As you can see, this can become relatively expensive if there are many resizes.
This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.
Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.
Performance for a HashMap:
OTOH a TreeMap stores entries in a balanced tree – a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable – unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.
Each insert and lookup is more expensive given a well-distributed HashMap.
Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.
Performance for a TreeMap:
Summary
First thoughts: Dataset size:
In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?
One final point if there is not an overwhelming point from above:
However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil 🙂