I am using this simple algorithm for searching some text in document and taging on which page I found it
for (int i = 1; i <= a.PageCount; i++)
{
Buf.Append(a.Pages[i].Text);
String contain = Buf.ToString();
if (contain != "")
{
// Inside is dictionary of keys and value contain page where I found it
foreach (KeyValuePair<string, List<string>> pair in inside)
{
if (contain.Contains(pair.Key))
inside[pair.Key].Add((i).ToString());
}
}
Buf.Clear();
}
I have no problem with it, but when I search in 700 pages document and I am looking for over 500 keys, its very slow, took about 1-2 minutes to pass, is there any way how to speed it up? I am using c#
Thanks!
A few points:
Buf; just assigna.Pages[i].Textdirectly tocontain:inside[pair.Key]wastes time looking up the value associated with that key; the time is wasted because you have a much cheaper reference to that object inpair.Value.Sample code:
Finally, make sure
Pagesdoes in fact use a one-based index. Collections are more commonly zero-indexed.EDIT since
Pagesis a dictionary:How many times did you time the first code sample? The time could vary depending on many external factors; the fact that a single run of one approach is faster or slower than a single run of another doesn’t really tell you much, especially since the suggestions I made probably don’t address the bulk of the problem.
As someone else pointed out, the main problem is that you’re calling
contain.Contains(pair.Key)350,000 times; that’s probably your bottleneck. You can profile the method to find out if that is true. If it is true, then something like the Rabin Karp algorithm as suggested by Miserable Variable is probably your best bet.