Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 752983
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T14:50:41+00:00 2026-05-14T14:50:41+00:00

I need to compare documents stored in a DB and come up with a

  • 0

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

Is there any program that can do this? Or should I start writing this from scratch?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T14:50:41+00:00Added an answer on May 14, 2026 at 2:50 pm

    Check out NLTK package: http://www.nltk.org it has everything what you need

    For the cosine_similarity:

    
    def cosine_distance(u, v):
        """
        Returns the cosine of the angle between vectors v and u. This is equal to
        u.v / |u||v|.
        """
        return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 
    

    For ngrams:

    
    def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
        """
        A utility that produces a sequence of ngrams from a sequence of items.
        For example:
    
        >>> ngrams([1,2,3,4,5], 3)
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    
        Use ingram for an iterator version of this function.  Set pad_left
        or pad_right to true in order to get additional ngrams:
    
        >>> ngrams([1,2,3,4,5], 2, pad_right=True)
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
    
        @param sequence: the source data to be converted into ngrams
        @type sequence: C{sequence} or C{iterator}
        @param n: the degree of the ngrams
        @type n: C{int}
        @param pad_left: whether the ngrams should be left-padded
        @type pad_left: C{boolean}
        @param pad_right: whether the ngrams should be right-padded
        @type pad_right: C{boolean}
        @param pad_symbol: the symbol to use for padding (default is None)
        @type pad_symbol: C{any}
        @return: The ngrams
        @rtype: C{list} of C{tuple}s
        """
    
        if pad_left:
            sequence = chain((pad_symbol,) * (n-1), sequence)
        if pad_right:
            sequence = chain(sequence, (pad_symbol,) * (n-1))
        sequence = list(sequence)
    
        count = max(0, len(sequence) - n + 1)
        return [tuple(sequence[i:i+n]) for i in range(count)] 
    

    for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

    http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

    if you like pylucene, this will tell you how to comute tf.idf

        # reader = lucene.IndexReader(FSDirectory.open(index_loc))
        docs = reader.numDocs()
        for i in xrange(docs):
            tfv = reader.getTermFreqVector(i, fieldname)
            if tfv:
                rec = {}
                terms = tfv.getTerms()
                frequencies = tfv.getTermFrequencies()
                for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                        df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                            tmap.setdefault(t, len(tmap))
                            rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
                # and normalize the values using cosine normalization
                if cosine_normalization:
                    denom = sum([x**2 for x in rec.values()])**0.5
                    for k,v in rec.items():
                        rec[k] = v / denom
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two documents with a simple schema that I need to compare: current
I have a project where I need to compare multi-chapter documents to a second
I need to compare a current integer with a previous integar within a method.
I need to compare strings in shell: var1=mtu eth0 if [ $var1 == mtu
I need to compare 2 strings as equal such as these: Lubeck == Lübeck
I need to compare the integer part of two doubles for inequality and I'm
I need to compare the answer in with the aspnet_membership tables PasswordAnswer value. The
I need to compare build outputs of VS2005 in order to be sure I
I need to compare 2 strings in C# and treat accented letters the same
I have 2 arrays of 16 elements (chars) that I need to compare and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.