Inspired by these two questions: String manipulation: calculate the "similarity of a string with

Question

0

Asked: May 30, 20262026-05-30T01:57:55+00:00 2026-05-30T01:57:55+00:00

Inspired by these two questions: String manipulation: calculate the "similarity of a string with

0

Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.

The questions will be

Is it correct, or have I made a mistake in my reasoning?
What is the worst case complexity of the algorithm?

A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n times is n*(n+1)/2.

Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string’s prefixes play the central role.

To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.

Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.

Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either i == n or s[i] != s[k], otherwise it’s extensible (the length k+1 prefix of s is then a border of the length i+1 prefix of s).

Now, if the longest common prefix of s and the suffix starting with s[i], i > 0, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it’s a common prefix of s and s[i .. n-1], and if it were extensible, it wouldn’t be the longest common prefix.

Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k].

So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n. To do that

Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.
Calculate the width of the widest non-extensible borders of the prefixes.
For each i, 1 <= i <= n, if p = s[0 .. i-1] has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length.
Add the length n of s, since that isn’t covered by the above.

Code (C):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/*
 * Overflow and NULL checks omitted to not clutter the algorithm.
 */

int similarity(char *text){
    int *borders, *ne_borders, len = strlen(text), i, j, sim;
    borders = malloc((len+1)*sizeof(*borders));
    ne_borders = malloc((len+1)*sizeof(*ne_borders));
    i = 0;
    j = -1;
    borders[i] = j;
    ne_borders[i] = j;
    /*
     * Find the length of the widest borders of prefixes of text,
     * standard KMP way, O(len).
     */
    while(i < len){
        while(j >= 0 && text[i] != text[j]){
            j = borders[j];
        }
        ++i, ++j;
        borders[i] = j;
    }
    /*
     * For each prefix, find the length of its widest non-extensible
     * border, this part is also O(len).
     */
    for(i = 1; i <= len; ++i){
        j = borders[i];
        /*
         * If the widest border of the i-prefix has width j and is
         * extensible (text[i] == text[j]), the widest non-extensible
         * border of the i-prefix is the widest non-extensible border
         * of the j-prefix.
         */
        if (text[i] == text[j]){
            j = ne_borders[j];
        }
        ne_borders[i] = j;
    }
    /* The longest common prefix of text and text is text. */
    sim = len;
    for(i = len; i > 0; --i){
        /*
         * If a longest common prefix of text and one of its suffixes
         * ends right before text[i], it is a non-extensible border of
         * the i-prefix of text, and conversely, every non-extensible
         * border of the i-prefix is a longest common prefix of text
         * and one of its suffixes.
         *
         * So, if the i-prefix has any non-extensible border, we must
         * sum the lengths of all these. Starting from the widest
         * non-extensible border, we must check all of its non-empty
         * borders for extendibility.
         *
         * Can this introduce nonlinearity? How many extensible borders
         * shorter than the widest non-extensible border can a prefix have?
         */
        if ((j = ne_borders[i]) > 0){
            sim += j;
            while(j > 0){
                j = borders[j];
                if (text[i] != text[j]){
                    sim += j;
                }
            }
        }
    }
    free(borders);
    free(ne_borders);
    return sim;
}


/* The naive algorithm for comparison */
int common_prefix(char *text, char *suffix){
    int c = 0;
    while(*suffix && *suffix++ == *text++) ++c;
    return c;
}

int naive_similarity(char *text){
    int len = (int)strlen(text);
    int i, sim = 0;
    for(i = 0; i < len; ++i){
        sim += common_prefix(text,text+i);
    }
    return sim;
}

int main(int argc, char *argv[]){
    int i;
    for(i = 1; i < argc; ++i){
        printf("%d\n",similarity(argv[i]));
    }
    for(i = 1; i < argc; ++i){
        printf("%d\n",naive_similarity(argv[i]));
    }
    return EXIT_SUCCESS;
}

So, is this correct? I’d be rather surprised if not, but I’ve been wrong before.

What is the worst case complexity of the algorithm?

I think it’s O(n), but I haven’t yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).

I’m most interested in sharp bounds, but if you can prove that it’s e.g. O(n*log n) or O(n^(1+x)) for small x, that’s already good. (It’s obviously at worst quadratic, so an answer of “It’s O(n^2)” is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T01:57:56+00:00

This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).

Here is my attempt at a counterexample. (I’m not a mathematician so please forgive my use of Python instead of equations to express my ideas!)

Consider the string with 4K+1 symbols

s = 'a'*K+'X'+'a'*3*K

This will have

borders[1:] = range(K)*2+[K]*(2*K+1)

ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)

Note that:

1) ne_borders[i] will equal K for (2K+1) values of i.

2) for 0<=j<=K, borders[j]=j-1

3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i

4) the inner loop will iterate K times to reduce j to 0

5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.

For example, for K=4 it goes round the inner loop 39 times

s = 'aaaaXaaaaaaaaaaaa'
borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4]
ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]

For K=2,248 it goes round the inner loop 10,111,503 times!

Perhaps there is a way to fix the algorithm for this case?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Inspired by these two questions: String manipulation: calculate the "similarity of a string with

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply