I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag).

Question

0

Asked: May 25, 20262026-05-25T15:54:30+00:00 2026-05-25T15:54:30+00:00

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag).

0

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that’s not crucial to my question. The equals contract defined by the interface is as you would expect:

obj instanceof MyInterface
obj contains the same elements as this (by equals)
obj contains the same number of duplicates for each element
order of elements is disregarded

Now I want to write my hashCode method. My initial idea was:

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don’t understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it’s about not calculating the same hash code more than once, but then why not * count(o)?

My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T15:54:30+00:00

Update

Let’s say, for example, in the case where we’d have an array that we want to treat as a multiset.

So you have to process all the entries as they come, you can’t use count, and can’t assume the entries come in a known order.

The general function I’d consider is

int hashCode() {
    int x = INITIAL_VALUE;
    for (Object o : this) {
        x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
    }
    return h(x);
}

Some observations:

As already stated in the other answers, the INITIAL_VALUE doesn’t matter much.
I wouldn’t go for NULL_HASH=0 since this would ignore null values.
The function g can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
The function h can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int).
The function f is the most important one, unfortunately, it’s quite limited as it obviously must be both associative and commutative.
The function f should be bijective in both arguments, otherwise you’d generate unnecessary collisions.

In no case I’d recommend f(x, y) = x^y since it’d make two occurrences of an element to cancel out. Using addition is better. Something like

f(x, y) = x + (2*A*x + 1) * y

where A is a constant satisfies all the above conditions. It may be worth it.
For A=0 it degenerates to addition, using an even A is not good as it shift bits of x*y out.
Using A=1 is fine, and the expression 2*x+1 can be computed using a single instruction on the x86 architecture.
Using a larger odd A might work better in case the hashes of the members are badly distributed.

In case you go for a non-trivial hashCode() you should test if it works correctly. You should measure the performance of your program, maybe you’ll find simple addition sufficient. Otherwise, I’d for for NULL_HASH=1, g=h=identity, and A=1.

My old answer

It may be for efficiency reasons. Calling count may be expensive for some implementations, but entrySet may be used instead. Still it might be more costly, I can’t tell.

I did a simple collision benchmark for Guava’s hashCode and Rinke’s and my own proposals:

enum HashCodeMethod {
    GUAVA {
        @Override
        public int hashCode(Multiset<?> multiset) {
            return multiset.hashCode();
        }
    },
    RINKE {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Object o : multiset.elementSet()) {
                result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
            }
            return result;
        }
    },
    MAAARTIN {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Multiset.Entry<?> e : multiset.entrySet()) {
                result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
            }
            return result;
        }
    }
    ;
    public abstract int hashCode(Multiset<?> multiset);
}

The collision counting code went as follows:

private void countCollisions() throws Exception {
    final String letters1 = "abcdefgh";
    final String letters2 = "ABCDEFGH";
    final int total = letters1.length() * letters2.length();
    for (final HashCodeMethod hcm : HashCodeMethod.values()) {
        final Multiset<Integer> histogram = HashMultiset.create();
        for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
            for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
            }
        }
        System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
    }
}

and printed

Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0

So in this simple example Guava’s hashCode performed really bad (45 collisions out of 63 possible). However, I don’t claim my example is of much relevance for real life.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag).

Leave an answerCancel reply

1 Answer

Update

My old answer

Leave an answer
Cancel reply