I have a large number of objects I need to store in memory for processing in Python. Specifically, I’m trying to remove duplicates from a large set of objects. I want to consider two objects “equal” if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I’m concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn’t this properly detect duplicate Person objects?
You forgot to also define
__eq__().