The problem runs as follows: if there are two strings str1 and str2, and another string str3, write a function which checks whether str3 contains both str1‘s letters and str2‘s letters in the same sequence as they were in the original sequences, though they may be interleaved. So, adbfec returns true for substrings adf and bec. I have written the following function in Python:
def isinter(str1,str2,str3):
p1,p2,p3 = 0,0,0
while p3 < len(str3):
if p1 < len(str1) and str3[p3] == str1[p1]:
p1 += 1
elif p2 < len(str2) and str3[p3] == str2[p2]:
p2 += 1
else:
break
p3 = p1+p2
return p3 == len(str3)
There is another version of this program, at ardentart (the last solution). Now which one is better? I think mine, for it probably does it in linear time. Whether it is better or not, is there any further room for optimization in my algo?
You could split all three strings in lists:
and then walk
list3with the same algorithm you use now, checking whetherlist3[i]is equal tolist1[0]orlist2[0]. If it was, you’ddelthe item from the appropriate list.Premature list end could then be caught as an exception.
The algorithm would be exactly the same, but implementation ought to be more performant.
UPDATE: turns out it actually isn’t (about double the time). Oh well, might be useful to know.
And while benchmarking different scenarios, it turned out that unless it is specified that the three string lengths are “exact” (i.e., len(p1)+len(p2) == len(p3) ), then the most effective optimization is to check first thing. This immediately discards all cases where the two input strings can’t match the third because of bad string lengths.
Then I encountered some cases where the same letter is in both strings, and assigning it to list1 or list2 might lead to one of the strings no longer matching. In those cases the algorithm fails with a false negative, which would require a recursion.
Then I ran some benchmarks on random strings, this the instrumentation (notice that it generates always valid shuffles, which may yield biased results):
The results seem to point to a superior efficiency of the cached+DP algorithm for short strings. When strings get longer (more than 3-4 characters), the cache+DP algorithm starts losing ground. At around length 10, the algorithm above performs twice as fast as the totally-recursive, cached version.
The DP algorithm performs better, but still worse than the above one, if strings contain repeated characters (I did this by restricting the range from a-z to a-i) and if the overlap is slight. For example in this case the DP loses by only 2us:
Not surprisingly, full overlap (one letter from each string in turn) sees the larger difference, with a ratio as high as 364:178 (a bit more than 2:1).