What is the most efficient algorithm for grouping identical items together in an array, given the following:
- Almost all items are duplicated several times.
- The items are not necessarily integers or anything else that’s similarly simple. The range of the keys is not even well-defined, let alone small. In fact, the keys can be arbitrary structs. This rules out the most simple forms of counting sort.
- We care about both asymptotic and non-asymptotic properties, and n may be small sometimes. However, when n is small, performance is still important because this function may be called several million times in a loop on millions of small datasets. This rules out any expensive hash function or using a complex data structure that needs to perform lots of memory allocations.
- The data may be sorted in arbitrary order as long as all identical items are grouped together.
If this is confusing, here’s an example, assuming such a function is named groupIdentical:
uint[] foo = [1,2,3,2,1,5,4,5]; uint[] bar = groupIdentical(foo); // One possibile correct value for bar: // bar == [2,2,1,1,3,4,5,5]. // Another possible correct answer: // bar == [1,1,2,2,5,5,4,3].
However, as a reminder, we cannot assume that the data is composed as integers.
Edit: Thank you for the answers. My main problem with hashing was that hash tables perform memory allocations to frequently. What I ended up doing was writing my own hash table that uses a region allocator that I had around to get around this problem. Works well.
I think you could just hash the objects, since real order doesn’t matter, only grouping. Identical objects will end up grouped in the same bucket. This is assuming that every type you’re interested in has its own hash function, or you can define your own and overload it (taking each type as a parameter to a different hashCode function definition).
To avoid collisions across data types (so strings don’t end up in the same bucket as doubles, for one example), you’d need to encode the data type into the hash. So, for example, if you have a 32-bit hash, maybe the first 5 bits could encode the data type, so you can have 32 different types in the same hash map.
EDIT: Let me just add that the reason that I’m suggesting a custom hash map is because I don’t know of one that exposes enough of its internal implementation for you to get the values out of each bucket. There might be such an implementation that I don’t know of. There are a lot of things I don’t know. 🙂