Although I understand very well what HashCode is and what a Hash Table does, I have to admit I don´t know how to use it (Beyond a common dictionary). I wanted to implement my own Hash Table so first I want to know the very basic about Hash:
- I know I can get the hash code with
getHashCode()/hashCode()in Java and Scala. How is this number determined. (Just out of curiosity) - If I know the
HashCodeof an object, how can I access it? That is, how can I call that memory bucket? - Can I change/set a variables
HashCode?
Now, I have a very big (about 10^9) list of Int. I am going to access some of them (from none to all) and I need to do it in the fastest way possible. Is a hash table THE BEST way to do it?
PS: I don’t want to discuss it, I just want to know if the HashTable is know to be the most efficient. If there exist other good methods maybe you can point me to them.
Thanks,
The hash code is just a number that is guaranteed to be the same for every type of object “the same” as the original object.
This means that returning “0” for every hash code call would be valid, but self-defeating. The point is there can (and in most cases will) be duplicates.
If you know the hash code of an object, you cannot necessarily access it. per my example above, if all objects returned “0”, you still couldn’t ask which object has hash code 0. However, you could ask for ALL objects with hash code 0 and look through them (this is what a hashtable does, it reduces the amount of iterating by getting just the ones with the same hash code, then looks through those).
If you were to set (Change) a HashCode, it would not be a hash code because the value given for an object with a given “State” cannot change.
As for the “Best Way” to do it, the fewer unique objects that return the same hash code, the better your hash tables will perform. If you have a long list of “int”, you can just use that int value as your hash code and you will have that rare perfect hash–where each object maps to exactly one hash code.
Note that hashtable isn’t really appropriate for this situation of storing ints. It’s better for situations where you are trying to store complex objects that are not so easy to uniquely identify or compare using other mechanisms.
The problem with your “List of Int” is that if you have the number 5 and you want to look it up in your table, you are just going to find a number 5 there.
Now, if you want to see if the number 5 exists in your table or not, that’s a different matter.
For a set of numbers with few holes you could make a simple boolean array. If a[5] exists (is true), than a is in the list. If your set of numbers is very sparse (1, 5, 10002930304) then that wouldn’t be a very good solution since you’d store “False” in spots 2, 3, 4 and then a whole bunch of them before the last number, but it is a direct lookup, a single step that never takes any longer no matter how many numbers you add–O(1).
You could make this type of storage MUCH denser by making it a binary lookup into a byte array, but unless you’re pretty good with bit-manipulation, skip it. It would involve stuff that looks like this:
and this still runs out of memory if your highest number is really big.
So, for a large sparse list I’d use a sorted integer array instead of a lightly populated boolean array. Once it’s sorted as an array you just do a binary search; start in the middle of the sorted array, If the number you want is higher, then divide the top half of the list in the center and check that number, repeat.
The sorted int array takes a few more steps but not too many more and it doesn’t waste any memory for non-existent numbers.