I am building a dictionary of a very long string (~1G), where key is a fixed-length k-mer, and value is all the occurrence positions. When k is large (>9) it makes no sense to pre-build the k-mer dictionary, since not all values will occur & it inflates the table.
Currently I’m doing the task like this:
def hash_string(st, mersize):
stsize = len(st)
hash = {}
r = stsize-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
if mer in hash:
hash[mer].append(i)
else:
hash[mer] = [i]
return hash
# test for function hash_st above
mer3 = hash_string("ABCDABBBBBAAACCCCABCDDDD", 3)
The most time consuming step (I did cProfile) is looking up if a key encountered (as we move along the string), is a new key, or if it already exists. What is the fastest way to do this?
(I am currently testing out a two-pass strategy that avoids this step (which is much faster for large sequences), where I am first building a list of keys by simply over-writing doubles. And then I don’t have to check for key existence — I seed my dict with these keys, and then on the second pass simply do appends as I encounter them along.)
But I’d still be interested in knowing, to sum up, the fastest way to look up a dict key in Python, since this is a common pattern:
if key exists, append new entry, else, create key & add first element.
What’s the fastest implementation of this pattern?
I would use
collections.defaultdict:though have never profiled it vs
if ... else.