This is my first post, have been a lurker for a long time, so

Question

0

Asked: May 16, 20262026-05-16T14:10:38+00:00 2026-05-16T14:10:38+00:00

This is my first post, have been a lurker for a long time, so

0

This is my first post, have been a lurker for a long time, so will try my best to explain myself here.

I have been using lowest common substring method along with basic word match and substring match(regexp) for clustering similar stories on the net.
But the problem is its time complexity is n^2 (I compare each title to all the others).
I’ve done very basic optimizations like storing and skipping all the matched titles.

What I want is some kind of preprocessing of the chunk of text so that for each iteration i reduce number of posts to match to. Any further optimizations are also welcome.

Here are the functions i use for the same. the main function which calls them first calls word_match, if more than 70% of the word matches i further go down and call ‘substring_match’ and LCSubstr_len. The code is in Python, I can use C as well

import re

def substring_match(a,b):
    try:
        c = re.match(a,b) 
        return c if c else True if re.match(b,a) else False
    except:
        return False

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
     for j in xrange(n):
         if S[i] == T[j]:
             L[i+1][j+1] = L[i][j] + 1
             lcs = max(lcs, L[i+1][j+1])
         else:
             L[i+1][j+1] = max(L[i+1][j], L[i][j+1])
    return lcs/((float(m+n)/2))

def word_match(str1,str2):
    matched = 0
    try:
        str1,str2 = str(str1),str(str2)
        assert isinstance(str1,str)
    except:
        return 0.0
    words1 = str1.split(None)
    words2 = str2.split(None)
    for i in words1:
        for j in words2:
            if i.strip() ==j.strip():
                matched +=1
    len1 = len(words1)
    len2 = len(words2)
    perc_match = float(matched)/float((len1+len2)/2)
    return perc_match

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T14:10:39+00:00

Use an inverted index: for each word, store a list of pairs (docId, numOccurences).
Then, to find all strings which might be similar to a given string, go through its words and look up strings containing that word in the inverted index. This way you’ll get a table “(docId, wordMatchScore)” that automatically contains only entries where wordMatchScore is non-zero.

There are a huge number of possible optimizations; also, your code is extremely non-optimal, but if we’re talking about decreasing the number of string pairs for comparison, then that’s it.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is my first post, have been a lurker for a long time, so

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply