I am trying to optimize my code since when I try to load huge dictionaries it becomes really slow. I think It’s because it searchs for a key in the dictionary. I’ve been reading about python defaultdict and I think it might be a good improvement but I fail to implement it here. As you can see is a hierarchichal dictionary structure. Any hint will be appreciated.
class Species:
'''This structure contains all the information needed for all genes.
One specie have several genes, one gene several proteins'''
def __init__(self, name):
self.name = name #name of the GENE
self.genes = {}
def addProtein(self, gene, protname, len):
#Converting a line from the input file into a protein and/or an exon
if gene in self.genes:
#Gene in the structure
self.genes[gene].proteins[protname] = Protein(protname, len)
self.genes[gene].updateProts()
else:
self.genes[gene] = Gene(gene)
self.updateNgenes()
self.genes[gene].proteins[protname] = Protein(protname, len)
self.genes[gene].updateProts()
def updateNgenes(self):
#Updating the number of genes
self.ngenes = len(self.genes.keys())
The definitions of gene and Protein are:
class Protein:
#The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
def __init__(self, name, len):
self.name = name
self.len = len
class Gene:
#The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
def __init__(self, name):
self.name = name
self.proteins = {}
self.updateProts()
def updateProts(self):
#Update number of proteins
self.nproteins = len(self.proteins)
You cannot use a
defaultdictbecause your__init__methods require arguments.This is probably one of your bottlenecks:
len(self.genes.keys())creates alistof all keys before calculating length. This means that every time you add a gene, you create a list and throw it away. This list creation gets more and more expensive the more genes you have. To avoid creating an intermediate list, just dolen(self.genes).Better yet would be to make
ngenesa property so it is only calculated when you need it.The same can be done with
nproteinsin theGeneclass.Here is your code refactored: