Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8260715
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T03:09:34+00:00 2026-06-08T03:09:34+00:00

this question is asked here before What is a good strategy to group similar

  • 0

this question is asked here before

What is a good strategy to group similar words?

but no clear answer is given on how to “group” items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?

I would like to reduce

['ape', 'appel', 'apple', 'peach', 'puppy']

to

['ape', 'appel', 'peach', 'puppy']

or

['ape', 'apple', 'peach', 'puppy']

One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.

I would appreciate any pointers, names of libraries, etc.

Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?

Thanks,

Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require “centers”, it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T03:09:35+00:00Added an answer on June 8, 2026 at 3:09 am

    You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.

    Some possible ways:

    • Pick the first encountered word.
    • Pick the lexicographic first word.
    • Derive a pattern for all the words.
    • Pick an unique index.
    • Use the soundex as pattern.

    Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.


    Going by the first alternative (first encountered word):

    class Seeder:
        def __init__(self):
            self.seeds = set()
            self.cache = dict()
    
        def get_seed(self, word):
            LIMIT = 2
            seed = self.cache.get(word,None)
            if seed is not None:
                return seed
            for seed in self.seeds:
                if self.distance(seed, word) <= LIMIT:
                    self.cache[word] = seed
                    return seed
            self.seeds.add(word)
            self.cache[word] = word
            return word
    
        def distance(self, s1, s2):
            l1 = len(s1)
            l2 = len(s2)
            matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
            for zz in xrange(0,l2):
                for sz in xrange(0,l1):
                    if s1[sz] == s2[zz]:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                    else:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
            return matrix[l2][l1]
    
    import itertools
    
    def group_similar(words):
        seeder = Seeder()
        words = sorted(words, key=seeder.get_seed)
        groups = itertools.groupby(words, key=seeder.get_seed)
        return [list(v) for k,v in groups]
    

    Example:

    import pprint
    
    print pprint.pprint(group_similar([
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
        'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
        'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
        'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
        'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
        'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
        'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
        'people', 'into', 'year', 'your', 'good', 'some', 'could',
        'them', 'see', 'other', 'than', 'then', 'now', 'look',
        'only', 'come', 'its', 'over', 'think', 'also', 'back',
        'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
        'way', 'even', 'new', 'want', 'because', 'any', 'these',
        'give', 'day', 'most', 'us'
    ]), width=120)
    

    Output:

    [['after'],
     ['also'],
     ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
     ['back'],
     ['because'],
     ['but', 'about', 'get', 'just'],
     ['first'],
     ['from'],
     ['good', 'look'],
     ['have', 'make', 'give'],
     ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
     ['into'],
     ['know', 'new'],
     ['like', 'time', 'take'],
     ['most'],
     ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
     ['only'],
     ['over', 'our', 'even'],
     ['people'],
     ['say', 'she', 'way', 'day'],
     ['some', 'see', 'come'],
     ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
     ['think'],
     ['well'],
     ['what', 'who', 'when', 'than'],
     ['with', 'will', 'which'],
     ['work'],
     ['would', 'could'],
     ['year', 'your']]
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This question has been asked here before, but the author wasn't very clear and
I know this question was asked here many times before but I am still
I know that this sort of question has been asked here before, but still
Ok, so this question has been asked before here . In the response/answer to
I know this exact question was asked here , but the answer didn't work
I know this question has been asked here before, but I don't think those
Various incarnations of this question have been asked here before, but I thought I'd
I believe this question is slightly different than similar ones asked on here before
I apologise if this question has already been asked on here before but I
Folks, I know this question has been asked before here, though indirectly . But

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.