I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that’s not crucial to my question. The equals contract defined by the interface is as you would expect:
obj instanceof MyInterfaceobjcontains the same elements asthis(byequals)objcontains the same number of duplicates for each element- order of elements is disregarded
Now I want to write my hashCode method. My initial idea was:
int hashCode = 1;
for( Object o : this ) {
hashCode += o.hashCode();
}
However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:
int hashCode = 0;
for( Object o : elementSet() ) {
hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}
It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don’t understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it’s about not calculating the same hash code more than once, but then why not * count(o)?
My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.
Update
So you have to process all the entries as they come, you can’t use
count, and can’t assume the entries come in a known order.The general function I’d consider is
Some observations:
NULL_HASH=0since this would ignore null values.gcan be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).hcan be used to improve the result, which is not very important since this already happens e.g. inHashMap.hash(int).fis the most important one, unfortunately, it’s quite limited as it obviously must be both associative and commutative.fshould be bijective in both arguments, otherwise you’d generate unnecessary collisions.In no case I’d recommend
f(x, y) = x^ysince it’d make two occurrences of an element to cancel out. Using addition is better. Something likewhere
Ais a constant satisfies all the above conditions. It may be worth it.For
A=0it degenerates to addition, using an evenAis not good as it shift bits ofx*yout.Using
A=1is fine, and the expression2*x+1can be computed using a single instruction on thex86architecture.Using a larger odd
Amight work better in case the hashes of the members are badly distributed.In case you go for a non-trivial
hashCode()you should test if it works correctly. You should measure the performance of your program, maybe you’ll find simple addition sufficient. Otherwise, I’d for forNULL_HASH=1,g=h=identity, andA=1.My old answer
It may be for efficiency reasons. Calling
countmay be expensive for some implementations, butentrySetmay be used instead. Still it might be more costly, I can’t tell.I did a simple collision benchmark for Guava’s hashCode and Rinke’s and my own proposals:
The collision counting code went as follows:
and printed
So in this simple example Guava’s hashCode performed really bad (45 collisions out of 63 possible). However, I don’t claim my example is of much relevance for real life.