I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries
This is what I have so far, it works, but is super slow:
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)
#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)
# loop through each token in doca and see if one matches in docb
for x in doca_dic:
if docb_dic.has_key(x):
#calculate the similarity by summing the products of the tf-idf_norm
similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity
I’m pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated.
Thanks.
A Python point:
adict.has_key(k)is obsolete in Python 2.X and vanished in Python 3.X.k in adictas an expression has been available since Python 2.2; use it instead. It will be faster (no method call).An any-language practical point: iterate over the shorter dictionary.
Combined result:
And if you don’t need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the
docb = cursor2.fetchall(), replace all following code by this:Alternative to the above code: This is doing more work but it’s doing more of the iterating in C instead of Python and may be faster.
Final version of the Python code
Another practical point: you haven’t said which part of it is slow … working on the dicts or doing the selects? Put some calls of
time.time()into your script.Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.
And it’s worth checking that the database table is appropriately indexed (e.g. one on
tokenby itself) … not having a usable index is a good way of making an SQL query run very slowly.Explanation: Having an index on
tokenmay make either your existing queries or the “do all the work in the DB” query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don’t have a usable index, the DB will read ALL the rows in your table — not good.Creating an index:
create index atable_token_idx on atable(token);Dropping an index:
drop index atable_token_idx;(but do consult the docs for your DB)