Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 668875
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T00:05:21+00:00 2026-05-14T00:05:21+00:00

I have a loop that calculates the similarity between two documents. It collects all

  • 0

I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries

This is what I have so far, it works, but is super slow:

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)

#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)

# loop through each token in doca and see if one matches in docb
for x in doca_dic:
    if docb_dic.has_key(x):
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity

I’m pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated.
Thanks.

  • 1 1 Answer
  • 2 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T00:05:21+00:00Added an answer on May 14, 2026 at 12:05 am

    A Python point: adict.has_key(k) is obsolete in Python 2.X and vanished in Python 3.X. k in adict as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).

    An any-language practical point: iterate over the shorter dictionary.

    Combined result:

    if len(doca_dic) < len(docb_dict):
        short_dict, long_dict = doca_dic, docb_dic
    else:
        short_dict, long_dict = docb_dic, doca_dic
    similarity = 0
    for x in short_dict:
        if x in long_dict:
            #calculate the similarity by summing the products of the tf-idf_norm 
            similarity += short_dict[x] * long_dict[x]
    

    And if you don’t need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the docb = cursor2.fetchall(), replace all following code by this:

    similarity = 0
    for b_token, b_value in docb:
        if b_token in doca_dic:
            similarity += doca_dic[b_token] * b_value
    

    Alternative to the above code: This is doing more work but it’s doing more of the iterating in C instead of Python and may be faster.

    similarity = sum(
        doca_dic[k] * docb_dic[k]
        for k in set(doca_dic) & set(docb_dic)
        )
    

    Final version of the Python code

    # Doc A
    cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
    doca = cursor1.fetchall()
    # Doc B
    cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
    docb = cursor2.fetchall()
    if len(doca) < len(docb):
        short_doc, long_doc = doca, docb
    else:
        short_doc, long_doc = docb, doca
    long_dict = dict(long_doc) # yes, it should be that simple
    similarity = 0
    for key, value in short_doc:
        if key in long_dict:
            similarity += long_dict[key] * value
    

    Another practical point: you haven’t said which part of it is slow … working on the dicts or doing the selects? Put some calls of time.time() into your script.

    Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.

    C:\junk\so>sqlite3
    SQLite version 3.6.14
    Enter ".help" for instructions
    Enter SQL statements terminated with a ";"
    sqlite> create table atable(docid text, token text, score float,
        primary key (docid, token));
    sqlite> insert into atable values('a', 'apple', 12.2);
    sqlite> insert into atable values('a', 'word', 29.67);
    sqlite> insert into atable values('a', 'zulu', 78.56);
    sqlite> insert into atable values('b', 'apple', 11.0);
    sqlite> insert into atable values('b', 'word', 33.21);
    sqlite> insert into atable values('b', 'zealot', 11.56);
    sqlite> select sum(A.score * B.score) from atable A, atable B
        where A.token = B.token and A.docid = 'a' and B.docid = 'b';
    1119.5407
    sqlite>
    

    And it’s worth checking that the database table is appropriately indexed (e.g. one on token by itself) … not having a usable index is a good way of making an SQL query run very slowly.

    Explanation: Having an index on token may make either your existing queries or the “do all the work in the DB” query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don’t have a usable index, the DB will read ALL the rows in your table — not good.

    Creating an index: create index atable_token_idx on atable(token);

    Dropping an index: drop index atable_token_idx;

    (but do consult the docs for your DB)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two SQL scripts which get called within a loop that accept a
I have a loop that is looping through a document library like in the
I have a loop that reads each line in a file using getline() :
In my WPF client, I have a loop that calls a WCF service to
I'm trying to solve the 3n+1 problem and I have a for loop that
I have a loop on page to update an access database that takes 15-20
So I have an IList of business entities that I loop through in a
I have a variable that is built in loop. Something like: $str = ;
I have a table that's generated by a normal PHP loop. What I want
I have a thread that, when its function exits its loop (the exit is

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.