Java Tutorials (Set Implementations):
One thing worth keeping in mind about HashSet is that iteration is linear in the sum of the number of entries and the number of buckets (the capacity).
I find this statement confusing and was wondering if someone could clarify the meaning of the statement. From what I understand, best iteration performance is achieved if we have x buckets and exactly 1 item within each bucket.
Let’s sub x = 200k. This gives us 200k number of entries and 200k buckets.
Conversely, if all items are in 1 bucket (which from what I read, is really horrible), we will have 200k number of entries and 1 bucket.
Since 200k + 200k > 200k + 1, doesn’t that mean that if we apply the above statement, the performance of 1 bucket is more than the performance of 200k buckets?
Yes, when iterating over all elements in a HashSet, the fact that they are spread out in several buckets is bad.
When they say that iteration is linear in the sum of the number of entries and the number of buckets, they mean that iteration is in O(n + m) where n is the number of buckets and m the number of entries. The constants are not revealed. It could for instance be the case that the time it takes is 0.0001 * n + m, i.e., that the impact of the number of buckets is really really small compared to the impact of the number of elements.
(BTW, there is another data structure called
LinkedHashSetwith similar characteristics to HashSet, but with iteration time proportional only to the number of elements.)