I have the following code
private void LoadIntoMemory()
{
//Init large HashSet
HashSet<document> hsAllDocuments = new HashSet<document>();
//Get first rows from database
List<document> docsList = document.GetAllAboveDocID(0, 500000);
//Load objects into dictionary
foreach (document d in docsList)
{
hsAllDocuments.Add(d);
}
Application["dicAllDocuments"] = hsAllDocuments;
}
private HashSet<document> documentHits(HashSet<document> hsRawHit, HashSet<document> hsAllDocuments, string query, string[] queryArray)
{
int counter = 0;
const int maxCount = 1000;
foreach (document d in hsAllDocuments)
{
//Headline
if (d.Headline.Contains(query))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//Description
if (d.Description.Contains(query))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//splitted query word by word
//string[] queryArray = query.Split(' ');
if (queryArray.Count() > 1)
{
foreach (string q in queryArray)
{
if (d.Headline.Contains(q))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//Description
if (d.Description.Contains(q))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
}
}
}
return hsRawHit;
}
First I load all the data into a hashset (via Application for later use) – runs fine – totally OK to be slow for what I’m doing.
Will be running 4.0 framework in C# (can’t update to the new upgrade for 4.0 with the async stuff).
The documentHits method runs fairly slow on my current setup – considering that it’s all in memory. What can I do to speed up this method?
Examples would be awesome – thanks.
I see that you are using a
HashSet, but you are not using any of it’s advantages, so you should just use aListinstead.What’s taking time is looping through all the documents and looking for strings in other strings, so you should try to elliminate as much as possible of that.
One possibility is to set up indexes of which documents contains which character pairs. If the string
querycontainsHello, you would be looking in the documents that containsHe,el,llandlo.You could set up a
Dictionary<string, List<int>>where the dictionary key is the character combinations and the list contains indexes to the documents in your document list. Setting up the dictionary will take some time, of course, but you can focus on the character combinations that are less common. If a character combination exists in 80% of the documents, it’s pretty useless for elliminating documents, but if a character combination exists in only 2% of the documents it has elliminated 98% of your work.If you loop through the documents in the list and add occurances to the lists in the dictionary, the lists of indexes will be sorted, so it will be very easy to join the lists later on. When you add indexes to the list, you can throw away lists when they get too large and you see that they would not be useful for elliminating documents. That way you will only be keeping the shorter lists and they will not consume so much memory.
Edit:
It put together a small example:
Test:
Output: