Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.
The questions will be
- Is it correct, or have I made a mistake in my reasoning?
- What is the worst case complexity of the algorithm?
A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n times is n*(n+1)/2.
Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string’s prefixes play the central role.
To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.
Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.
Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either i == n or s[i] != s[k], otherwise it’s extensible (the length k+1 prefix of s is then a border of the length i+1 prefix of s).
Now, if the longest common prefix of s and the suffix starting with s[i], i > 0, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it’s a common prefix of s and s[i .. n-1], and if it were extensible, it wouldn’t be the longest common prefix.
Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k].
So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n. To do that
- Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.
- Calculate the width of the widest non-extensible borders of the prefixes.
- For each i,
1 <= i <= n, ifp = s[0 .. i-1]has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length. - Add the length n of s, since that isn’t covered by the above.
Code (C):
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/*
* Overflow and NULL checks omitted to not clutter the algorithm.
*/
int similarity(char *text){
int *borders, *ne_borders, len = strlen(text), i, j, sim;
borders = malloc((len+1)*sizeof(*borders));
ne_borders = malloc((len+1)*sizeof(*ne_borders));
i = 0;
j = -1;
borders[i] = j;
ne_borders[i] = j;
/*
* Find the length of the widest borders of prefixes of text,
* standard KMP way, O(len).
*/
while(i < len){
while(j >= 0 && text[i] != text[j]){
j = borders[j];
}
++i, ++j;
borders[i] = j;
}
/*
* For each prefix, find the length of its widest non-extensible
* border, this part is also O(len).
*/
for(i = 1; i <= len; ++i){
j = borders[i];
/*
* If the widest border of the i-prefix has width j and is
* extensible (text[i] == text[j]), the widest non-extensible
* border of the i-prefix is the widest non-extensible border
* of the j-prefix.
*/
if (text[i] == text[j]){
j = ne_borders[j];
}
ne_borders[i] = j;
}
/* The longest common prefix of text and text is text. */
sim = len;
for(i = len; i > 0; --i){
/*
* If a longest common prefix of text and one of its suffixes
* ends right before text[i], it is a non-extensible border of
* the i-prefix of text, and conversely, every non-extensible
* border of the i-prefix is a longest common prefix of text
* and one of its suffixes.
*
* So, if the i-prefix has any non-extensible border, we must
* sum the lengths of all these. Starting from the widest
* non-extensible border, we must check all of its non-empty
* borders for extendibility.
*
* Can this introduce nonlinearity? How many extensible borders
* shorter than the widest non-extensible border can a prefix have?
*/
if ((j = ne_borders[i]) > 0){
sim += j;
while(j > 0){
j = borders[j];
if (text[i] != text[j]){
sim += j;
}
}
}
}
free(borders);
free(ne_borders);
return sim;
}
/* The naive algorithm for comparison */
int common_prefix(char *text, char *suffix){
int c = 0;
while(*suffix && *suffix++ == *text++) ++c;
return c;
}
int naive_similarity(char *text){
int len = (int)strlen(text);
int i, sim = 0;
for(i = 0; i < len; ++i){
sim += common_prefix(text,text+i);
}
return sim;
}
int main(int argc, char *argv[]){
int i;
for(i = 1; i < argc; ++i){
printf("%d\n",similarity(argv[i]));
}
for(i = 1; i < argc; ++i){
printf("%d\n",naive_similarity(argv[i]));
}
return EXIT_SUCCESS;
}
So, is this correct? I’d be rather surprised if not, but I’ve been wrong before.
What is the worst case complexity of the algorithm?
I think it’s O(n), but I haven’t yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).
I’m most interested in sharp bounds, but if you can prove that it’s e.g. O(n*log n) or O(n^(1+x)) for small x, that’s already good. (It’s obviously at worst quadratic, so an answer of “It’s O(n^2)” is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)
This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).
Here is my attempt at a counterexample. (I’m not a mathematician so please forgive my use of Python instead of equations to express my ideas!)
Consider the string with 4K+1 symbols
This will have
Note that:
1) ne_borders[i] will equal K for (2K+1) values of i.
2) for 0<=j<=K, borders[j]=j-1
3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i
4) the inner loop will iterate K times to reduce j to 0
5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.
For example, for K=4 it goes round the inner loop 39 times
For K=2,248 it goes round the inner loop 10,111,503 times!
Perhaps there is a way to fix the algorithm for this case?