I have a data structure of key value pairs and I want to implement “GROUP BY” value.
Both keys and values are strings.
So what I did was I gave every value(string) a unique “prime number”. Then for every key I stored the multiplication of all the prime numbers associated with different values that a particular key has.
So if key “Anirudh” has values “x”,”y”,”z”, then I store the number M(Key) = 2*3*5 = 30 as well.
Later if I want to do group by a particular value “x”(say) then I just iterate over all the keys, and divide the M(key) by the prime number associated with “x”. I then check if the remainder is 0 and if it is zero, then that particular “key” is a part of group by for value “x”.
I know that this is the most weird way to do it. Some people sort the key value pairs(sorted by values). I could have also created another table(hash table) already grouped by “values”. So I want to know a better method than mine (there must be many). In my method as the number of unique values for a particular key increases the product of prime number also increases (that too exponentially).
Your method will always perform O(n) to find group members because you have to iterate through all elements of the collection to find elements belonging to the target group. Your method also risks overflowing common integer bounds (32, 64 bit) if you have many elements since you are multiplying potentially a large number of prime numbers together to form your key.
You will find it more efficient and certainly more predictable to use a bit mask to track group membership following this approach. If you have 16 groups, you can represent that with a 16-bit short using a bit mask. Using primes as you suggest, you would need an integer with enough bits to hold the number 32589158477190044730 (first 16 primes multiplied together), which would require 65 bits.
Other approaches to grouping also are O(n) for the first iteration (after all, each element must be tested at least once for group membership). However, if you tend to repeat the same group checks, the other methods you refer to (e.g. keeping a list or hash table per target group) is much more efficient because subsequent group membership tests are O(1).
So to directly answer your question:
Given that repeat queries seem likely based on your question: