What algorithms could i use to determine common characters in a set of strings?
To make the example simple, I only care about 2+ characters in a row and if it shows up in 2 or more of the sample. For instance:
- 0000abcde0000
- 0000abcd00000
- 000abc0000000
- 00abc000de000
I’d like to know:
00 was used in 1,2,3,4
000 was used in 1,2,3,4
0000 was used in 1,2,3
00000 was used in 2,3
ab was used in 1,2,3,4
abc was used in 1,2,3,4
abcd was used in 1,2
bc was used in 1,2,3,4
bcd was used in 1,2
cd was used in 1,2
de was used in 1,4
I’m assuming that this is not homework. (If it is, you’re one your own re plagiarism! 😉
Below is a quick-and-dirty solution. The time complexity is
O(m**2 * n)wheremis the average string length andnis the size of the array of strings.An instance of
Occurrencekeeps the set of indices which contain a given string. ThecommonOccurrencesroutine scans a string array, callingcaptureOccurrencesfor each non-null string. ThecaptureOccurrencesroutine puts the current index into anOccurrencefor each possible substring of the string it is given. Finally,commonOccurrencesforms the result set by picking only thoseOccurrencesthat have at least two indices.Note that your example data has many more common substrings than you identified in the question. For example,
'00ab'occurs in each of the input strings. An additional filter to select interesting strings based on content (e.g. all digits, all alphabetic, etc.) is — as they say — left as an exercise for the reader. 😉QUICK AND DIRTY JAVA SOURCE:
SAMPLE OUTPUT: (note that there was actually only one Occurrence per line; I can’t seem to prevent the blockquote markup from merging lines)