We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,
"a" + "b" + "c" => "abc"
"abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
"abcdede" + "dedefgh" + "" => "abcdedefgh"
"abcde" + "d" + "ghlmn" => "abcdedghlmn"
"abcdef" + "" + "defghl" => "abcdefghl"
Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?
Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.
Our current brute-force implementation in Java is attached for reference,
public static String concat(String s1, String s2) {
if (s1 == null)
return s2;
if (s2 == null)
return s1;
int len = Math.min(s1.length(), s2.length());
// Find the index for the end of overlapping part
int index = -1;
for (int i = len; i > 0; i--) {
String substring = s2.substring(0, i);
if (s1.endsWith(substring)) {
index = i;
break;
}
}
StringBuilder sb = new StringBuilder(s1);
if (index < 0)
sb.append(s2);
else if (index <= s2.length())
sb.append(s2.substring(index));
return sb.toString();
}
Most of the other answers have focused on constant-factor optimizations, but it’s also possible to do asymptotically better. Look at your algorithm: it’s O(N^2). This seems like a problem that can be solved much faster than that!
Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that’s the value we’re looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.
That gives you an O(n) algorithm. Nice!
OverlappedStringLength(“abcdef”, “defghl”) returns 3