Background
In the PDB module of Biopython, PDB structures are parsed into Structure objects, which store the components of the structure in a SMCRA archiecture (Structure/Model/Chain/Residue/Atom). Each level of this hierarchy is represented by an object that inherits the Entity container class.
Equivalence
My problem is that at no point can any two Entity objects be equal.
Structures built from the same file are not equal:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct1 = parser.get_structure("1hgg", "pdb1hgg.ent")
>>> struct2 = parser.get_structure("1hgg", "pdb1hgg.ent")
>>> struct1 == struct2
False
Residues within that structure are not equal:
>>> first_res1 = struct1.get_residues().next()
>>> first_res2 = struct2.get_residues().next()
>>> first_res1 == first_res2
False
And so on.
If we were to parse the same PDB file separately, at no point could any of the Entity objects within the structures be equal.
Solution
The obvious solution to this problem is to never parse the same PDB file twice. Then, we have object identity and thus, equivalence. However, this answer seems incomplete to me.
Each Entity object can return an identification tuple with get_full_id(). This method gives all id’s from the top object down; it should be unique for each Entity within a structure, and unique across all structures if the proper PDB id was supplied when constructing the Structure object.
My solution for testing Entity equivalence is merely to compare this full id. That is:
def __eq__(self, other):
return self.get_full_id() == other.get_full_id()
Question
At this point, I’m asking if my implementation of Entity equivalence is sensible.
- Are false positives (e.g. differing structures that were supplied the same PDB id) a worry?
- Is it better to simply manually compare the full id’s whenever we need to test equivalence?
- And is there any reason why
__eq__was left unimplemented within thePDBmodule?
One common reason for not defining an
__eq__is that it makes things unhashable (so you can’t use them as dictionary keys or put them in sets), unless you also define a consistent__hash__function, and your objects are immutable.By default
__hash__for objects just uses the ID, which works even for mutable objects, since the ID never changes. But if you define a custom__eq__, you can’t keep hashing by ID, or you’ll get a situation where two objects can compare as equal but have different hashes, which is inconsistent with how hashing is supposed to work. So you have to define a custom__hash__function (which you can do), but then if your object is mutable, you can’t/shouldn’t really do that, either, so you’ll just have an unhashable object. Which may be all right for you.See more info in the python docs here.
So you can use a custom
__eq__as long as you don’t need your objects to be hashable, or if they’re immutable; otherwise things get more complicated. Or you could just leave__eq__alone and name your full ID comparison function something else, so as to not break hashability.I don’t know enough about what PDB IDs mean (in particular, whether false positives are possible) to tell whether your
__eq__implementation is reasonable from that standpoint.