Here’s my hash function for Strings
public class GoodHashFunctor implements HashFunctor {
@Override
public int hash(String item) {
String binaryRepString = "";
for(int i = 0; i < item.length(); i++){
// Add the String version of the binary version of the integer version of each character in item
binaryRepString += Integer.toBinaryString((int)(item.charAt(i)));
}
long longVersion = Long.parseLong(binaryRepString, 2) % Integer.MAX_VALUE;
return (int) longVersion;
}
}
However, when I try hashing large Strings (around 10-15 characters), I’m getting errors because when it tries to parseLong, it dies because it’s too big a number.
What do you all think I should do? And my professor said we can’t use Java’s hashCode()
I saw a similar post where the best answer was to hash this way:
int hash=7;
for (int i=0; i < strlen; i++) {
hash = hash*31+charAt(i);
}
But wouldn’t I run into the same problem? I guess it’d probably take a lot longer Strings to break it this new way. I dunno I’m fairly confused…
Why do you need to convert each character into a string (and that too in binary form) before converting it into a
long? Why not just have alongvalue to which you add thechar?This is homework, so I’m not posting code. You can also see any good algorithm book or search the web) for more about hashing.
Edit:
I understand you don’t want to just sum them up because anagrams will all have the same hash value. But I think you already know how to avoid that. Notice how by concatenating bits, you are basically adding bits to a value after having shifted them by some positions. i.e. “10101”+”10001″ is the same as 1010100000+10001 – 21<<5+17.
By shifting each character by an amount proportional to its position in the string, the value added to the hash depends on both the value and position of the character. Also, observe the same effect can be had by simply multiplying rather than scaling.
Another thing to watch out for is the fact that a
longhas only 64 bits. you can only pack so manychars into it before it starts to overflow. So most practical hash functions take the value modulo some number. Of course that means there is only a limited number of possible hash values for an unlimited number of input strings. Collisions are inevitable, but well chosen values for your shift/multiplier and mod can minimize the number of collisions.