So as an exercise, I’m building an algorithm to make searching for words (arbitrary sets of characters) within larger strings as fast as possible. With almost no previous knowledge of existing search algorithms, my approach so far has been the following:
- Map out occurrences of pairs of characters within the larger string (Pair -> List of positions).
- For each pair, store also the number of occurrences found within the larger string.
- Get all character pairs within the search word.
- Using the gotten pair that occurs least often in the string, check at each position for the remaining characters of the search term for a match.
That’s the gist of it. I suppose I could use maps with longer characters, but for now I’m just using pairs.
Is there much else I can do to make it faster? Am I approaching this the right way?
String-search is a heavily researched topic:
You are thinking about finding e.g. 2 consecutive characters and storing the frequency of that combination, this is a very expensive operation even if you use balancing datastructures. I dont really see how storing consecutive characters as a preprocessing-step would help you.
So, there are obviously many many algorithms for string-search. What i find interesting is, that there are some algorithms that dont even need to scan every character in the body-of-text. Example: if you search for the word ‘abbbbbc’ and you find the character ‘d’ as the next character of the body-of-text you can immediately jump ahead 5 characters without the need to even look what they are, then if the next character is a ‘b’ or ‘c’ you obviously have to go back and look if you made a mistake in jumping, but if not then you skipped over 5 characters with no need for comparison. This is difficult to implement however and leads to the theory of finite automata.