Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.
I tried converting it to a List, but the heap on my server doesn’t allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.
The size of the array without the duplicates is the one I will store in my database.
Help would be appreciated!
You could perhaps use a bit set? I don’t know how efficient Java’s BitSet is. But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.
1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.
This also has the side-effect of sorting the data too.
And… this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.