Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6384667
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T02:48:16+00:00 2026-05-25T02:48:16+00:00

I am implementing the tf-idf algorithm in a web application using Python, however it

  • 0

I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:

1) Create 2 dictionaries:

  • First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
  • Second dictionary; key (document id), value (set containing unique words of the doc)

Now, there is a petition of a user to get tfidf results of document d. What I do is:

2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:

2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)

2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.

Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T02:48:16+00:00Added an answer on May 25, 2026 at 2:48 am

    Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an “orthodox” version of your “inverted index”.

    Your bottleneck is the “on-the-fly” calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

    The only structure you need is a nested dictionary like that

    { "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc  } ,
      "term2" : ...
      etc..
    }
    

    properly updated every time you “feed” your corpus.

    And of course, keep somewhere your corpus cardinality…

    As a hobby and part of my work, I am implementing a python – redis backed small search engine. You might get some other ideas as well. Take a look here.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Whilst implementing my first MVVM application in WPF, I've been wondering about the pros
When implementing texture mapping on a standard polygonal mesh using the usual algorithm, you
Implementing a 'sandbox' environment in Python used to be done with the rexec module
While implementing a design using nested generic collections, I stumbled across those limitations apparently
When implementing the ViewModel in a Model-View-ViewModel architecture WPF application there seem to be
Implementing the ScriptControlClass was extremely easy, unfortunately the side effects with the language implementation
implementing publishActivity in PHP using the REST API using this code: $activity = array(
Implementing Document-View pattern I faced the problem: standard generic collections and dictionaries ( List<T>
While implementing XML file reading/writing in my application I saw that when I call
I have completed implementing Operation of Linear Linked List using C , Now inorder

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.