Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7519819
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T01:57:55+00:00 2026-05-30T01:57:55+00:00

Inspired by these two questions: String manipulation: calculate the "similarity of a string with

  • 0

Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.

The questions will be

  1. Is it correct, or have I made a mistake in my reasoning?
  2. What is the worst case complexity of the algorithm?

A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n times is n*(n+1)/2.

Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string’s prefixes play the central role.

To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.

Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.

Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either i == n or s[i] != s[k], otherwise it’s extensible (the length k+1 prefix of s is then a border of the length i+1 prefix of s).

Now, if the longest common prefix of s and the suffix starting with s[i], i > 0, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it’s a common prefix of s and s[i .. n-1], and if it were extensible, it wouldn’t be the longest common prefix.

Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k].

So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n. To do that

  1. Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.
  2. Calculate the width of the widest non-extensible borders of the prefixes.
  3. For each i, 1 <= i <= n, if p = s[0 .. i-1] has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length.
  4. Add the length n of s, since that isn’t covered by the above.

Code (C):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/*
 * Overflow and NULL checks omitted to not clutter the algorithm.
 */

int similarity(char *text){
    int *borders, *ne_borders, len = strlen(text), i, j, sim;
    borders = malloc((len+1)*sizeof(*borders));
    ne_borders = malloc((len+1)*sizeof(*ne_borders));
    i = 0;
    j = -1;
    borders[i] = j;
    ne_borders[i] = j;
    /*
     * Find the length of the widest borders of prefixes of text,
     * standard KMP way, O(len).
     */
    while(i < len){
        while(j >= 0 && text[i] != text[j]){
            j = borders[j];
        }
        ++i, ++j;
        borders[i] = j;
    }
    /*
     * For each prefix, find the length of its widest non-extensible
     * border, this part is also O(len).
     */
    for(i = 1; i <= len; ++i){
        j = borders[i];
        /*
         * If the widest border of the i-prefix has width j and is
         * extensible (text[i] == text[j]), the widest non-extensible
         * border of the i-prefix is the widest non-extensible border
         * of the j-prefix.
         */
        if (text[i] == text[j]){
            j = ne_borders[j];
        }
        ne_borders[i] = j;
    }
    /* The longest common prefix of text and text is text. */
    sim = len;
    for(i = len; i > 0; --i){
        /*
         * If a longest common prefix of text and one of its suffixes
         * ends right before text[i], it is a non-extensible border of
         * the i-prefix of text, and conversely, every non-extensible
         * border of the i-prefix is a longest common prefix of text
         * and one of its suffixes.
         *
         * So, if the i-prefix has any non-extensible border, we must
         * sum the lengths of all these. Starting from the widest
         * non-extensible border, we must check all of its non-empty
         * borders for extendibility.
         *
         * Can this introduce nonlinearity? How many extensible borders
         * shorter than the widest non-extensible border can a prefix have?
         */
        if ((j = ne_borders[i]) > 0){
            sim += j;
            while(j > 0){
                j = borders[j];
                if (text[i] != text[j]){
                    sim += j;
                }
            }
        }
    }
    free(borders);
    free(ne_borders);
    return sim;
}


/* The naive algorithm for comparison */
int common_prefix(char *text, char *suffix){
    int c = 0;
    while(*suffix && *suffix++ == *text++) ++c;
    return c;
}

int naive_similarity(char *text){
    int len = (int)strlen(text);
    int i, sim = 0;
    for(i = 0; i < len; ++i){
        sim += common_prefix(text,text+i);
    }
    return sim;
}

int main(int argc, char *argv[]){
    int i;
    for(i = 1; i < argc; ++i){
        printf("%d\n",similarity(argv[i]));
    }
    for(i = 1; i < argc; ++i){
        printf("%d\n",naive_similarity(argv[i]));
    }
    return EXIT_SUCCESS;
}

So, is this correct? I’d be rather surprised if not, but I’ve been wrong before.

What is the worst case complexity of the algorithm?

I think it’s O(n), but I haven’t yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).

I’m most interested in sharp bounds, but if you can prove that it’s e.g. O(n*log n) or O(n^(1+x)) for small x, that’s already good. (It’s obviously at worst quadratic, so an answer of “It’s O(n^2)” is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T01:57:56+00:00Added an answer on May 30, 2026 at 1:57 am

    This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).

    Here is my attempt at a counterexample. (I’m not a mathematician so please forgive my use of Python instead of equations to express my ideas!)

    Consider the string with 4K+1 symbols

    s = 'a'*K+'X'+'a'*3*K
    

    This will have

    borders[1:] = range(K)*2+[K]*(2*K+1)
    
    ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)
    

    Note that:

    1) ne_borders[i] will equal K for (2K+1) values of i.

    2) for 0<=j<=K, borders[j]=j-1

    3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i

    4) the inner loop will iterate K times to reduce j to 0

    5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.

    For example, for K=4 it goes round the inner loop 39 times

    s = 'aaaaXaaaaaaaaaaaa'
    borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4]
    ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]
    

    For K=2,248 it goes round the inner loop 10,111,503 times!

    Perhaps there is a way to fix the algorithm for this case?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This question is language agnostic but is inspired by these c/c++ questions. How to
Inspired by how Log4j reload its configuration files everytime I change it, I wrote
Inspired by Comparing two collections for equality irrespective of the order of items in
I wanted to try out jQuery Templates after being inspired by these 2 blog
its been two weeks and still we are running errands. The scenario, we managed
Inspired by Raymond Chen's post , say you have a 4x4 two dimensional array,
Inspired by Help understanding JQuery Attribute Equals Selector the question is: Which of these
Is there a compendium of virtual machines and languages derived or inspired by Lua?
Inspired by this question , I'd like to know whether there is any trick
Inspired by a now-deleted question; given a regex with named groups, is there a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.