I wrote a little C# application that indexes a book and executes a boolean textretrieval algorithm on the index. The class at the end of the post showes the implementation of both, building the index and executing the algorithm on it.
The code is called via a GUI-Button in the following way:
private void Execute_Click(object sender, EventArgs e)
{
Stopwatch s;
String output = "-----------------------\r\n";
String sr = algoChoice.SelectedItem != null ? algoChoice.SelectedItem.ToString() : "";
switch(sr){
case "Naive search":
output += "Naive search\r\n";
algo = NaiveSearch.GetInstance();
break;
case "Boolean retrieval":
output += "boolean retrieval\r\n";
algo = BooleanRetrieval.GetInstance();
break;
default:
outputTextbox.Text = outputTextbox.Text + "Choose retrieval-algorithm!\r\n";
return;
}
output += algo.BuildIndex("../../DocumentCollection/PilzFuehrer.txt") + "\r\n";
postIndexMemory = GC.GetTotalMemory(true);
s = Stopwatch.StartNew();
output += algo.Start("../../DocumentCollection/PilzFuehrer.txt", new String[] { "Pilz", "blau", "giftig", "Pilze" });
s.Stop();
postQueryMemory = GC.GetTotalMemory(true);
output += "\r\nTime elapsed:" + s.ElapsedTicks/(double)Stopwatch.Frequency + "\r\n";
outputTextbox.Text = output + outputTextbox.Text;
}
The first execution of Start(…) runs about 700µs, every rerun only takes <10µs.
The application is compiled with Visual Studio 2010 and the default ‘Debug’ buildconfiguration.
I experimentad a lot to find the reason for that including profiling and different implementations but the effect always stays the same.
I’d be hyppy if anyone could give me some new ideas what I shall try or even an explanation.
class BooleanRetrieval:RetrievalAlgorithm
{
protected static RetrievalAlgorithm theInstance;
List<String> documentCollection;
Dictionary<String, BitArray> index;
private BooleanRetrieval()
: base("BooleanRetrieval")
{
}
public override String BuildIndex(string filepath)
{
documentCollection = new List<string>();
index = new Dictionary<string, BitArray>();
documentCollection.Add(filepath);
for(int i=0; i<documentCollection.Count; ++i)
{
StreamReader input = new StreamReader(documentCollection[i]);
var text = Regex.Split(input.ReadToEnd(), @"\W+").Distinct().ToArray();
foreach (String wordToIndex in text)
{
if (!index.ContainsKey(wordToIndex))
{
index.Add(wordToIndex, new BitArray(documentCollection.Count, false));
}
index[wordToIndex][i] = true;
}
}
return "Index " + index.Keys.Count + "words.";
}
public override String Start(String filepath, String[] search)
{
BitArray tempDecision = new BitArray(documentCollection.Count, true);
List<String> res = new List<string>();
foreach(String searchWord in search)
{
if (!index.ContainsKey(searchWord))
return "No documents found!";
tempDecision.And(index[searchWord]);
}
for (int i = 0; i < tempDecision.Count; ++i )
{
if (tempDecision[i] == true)
{
res.Add(documentCollection[i]);
}
}
return res.Count>0 ? res[0]: "Empty!";
}
public static RetrievalAlgorithm GetInstance()
{
Contract.Ensures(Contract.Result<RetrievalAlgorithm>() != null, "result is null.");
if (theInstance == null)
theInstance = new BooleanRetrieval();
theInstance.Executions++;
return theInstance;
}
}
Cold/warm start of .Net application is usually impacted by JIT time and disk access time to load assemblies.
For application that does a lot of disk IO very first access to data on disk will be much slower than on re-run for the same data due to caching content (also applies to assembly loading) if data is small enough to fit in memory cache for the disk.