I have a function which works a very slow for my task (it must be 10-100 times faster)
Here is code
public long Support(List<string[]> sequences, string[] words)
{
var count = 0;
foreach (var sequence in sequences)
{
for (int i = 0; i < sequence.Length - words.Length + 1; i++)
{
bool foundSeq = true;
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
if (foundSeq)
{
count++;
break;
}
}
}
return count;
}
public void Support(List<string[]> sequences, List<SequenceInfo> sequenceInfoCollection)
{
System.Threading.Tasks.Parallel.ForEach(sequenceInfoCollection.Where(x => x.Support==null),sequenceInfo =>
{
sequenceInfo.Support = Support(sequences, sequenceInfo.Sequence);
});
}
Where List<string[]> sequences is a array of array of words. This array usually contains 250k+ rows. Each row is about 4-7 words. string[] words is a array of words(all words contains in sequences at least one time) which we trying to count.
The problem is foundSeq = foundSeq && sequence[i + j] == words[j];. This code take most of all execution time(Enumerable.MoveNext at second place). I want to hash all words in my array. Numbers compares faster then strings, right? I think it can help me to get 30%-80% of perfomance. But i need 10x! What can i to do? If you want to know it’s a part of apriory algorithm.
Support function check if the words sequence is a part any sequence in the sequences list and count how much times.
Knuth–Morris–Pratt algorithm
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a “word” W within a main “text string” S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.
From Wikipedia: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
This is one of the improvements that you should make. With a small difference: a “word” in your code is a “characters” in the terminology of the algorithm; your “words” array is what is a word in KMP.
The idea is that when you search for “abc def ghi jkl”, and have matched “abc def ghi” already, but the next word does not match, you can jump three positions.