I’m trying to build a Lucene Serializer class that would serialize/de-serialize objects (classes) with properties decorated with the DataMember and a special attribute with instruction on how to store the property/field in a Lucene index.
The class works fine when I need to retrieve a single object by a certain key/value pair.
But I noticed that if sometimes I need to retrieve all items, and there let’s say are 100,000 documents – then MySQL does it ~bout 10 times faster… for some reason…
Could you please review this code (Lucene experts) and suggest any possible performance related ideas for improvement ?
public IEnumerable<T> LoadAll()
{
IndexReader reader = IndexReader.Open(this.PathToLuceneIndex);
int itemsCount = reader.NumDocs();
for (int i = 0; i < itemsCount; i++)
{
if (!reader.IsDeleted(i))
{
Document doc = reader.Document(i);
if (doc != null)
{
T item = Deserialize(doc);
yield return item;
}
}
}
if (reader != null) reader.Close();
}
private T Deserialize(Document doc)
{
T itemInstance = Activator.CreateInstance<T>();
foreach (string fieldName in fieldTypes.Keys)
{
Field myField = doc.GetField(fieldName);
//Not every document may have the full collection of indexable fields
if (myField != null)
{
object fieldValue = myField.StringValue();
Type fieldType = fieldTypes[fieldName];
if (fieldType == typeof(bool))
fieldValue = fieldValue == "1" ? true : false;
if (fieldType == typeof(DateTime))
fieldValue = DateTools.StringToDate((string)fieldValue);
pF.SetValue(itemInstance, fieldName, fieldValue);
}
}
return itemInstance;
}
Thank you in advance!
Here are some tips:
First, don’t use
IndexReader.Open(string path). Not only will it be removed in the next major release of Lucene.net, it’s generally not your best option. There’s actually a ton of unnecessary code called when you let Lucene generate the directory for you. I suggest:You should also do as I did above, and open the
IndexReaderas readonly, if you don’t absolutely need to write to it, as it will be quicker in multi-threaded environments especially.If you know the size of your index is not more than you can hold into memory (ie less than 500-600 MB and not compressed), you can use a
RAMDirectoryinstead. This will load the entire index into memory allowing you to bypass most of the costly IO operations if you were leaving the index on disk. It should greatly improve your speed, especially if you do it with the other suggestions below.If the index is too large to fit in memory, you either need to split the index up into chunks (ie an index every n MBs) or just continue to read it from disk.
Also, I know you can’t
yield returnin atry...catch, but you can in atry...finally, and I would recommend wrapping your logic inLoadAll()into atry...finally, likeNow, when it comes to your actual Deserialize code, you’re probably doing it in nearly the fastest way possible, except that you are boxing the string when you don’t need to. Lucene only stores the field as a byte[] array or a string. Since you’re calling string value, you know it will always be a string, and should only have to box it if absolutely necessary. Change it to this:
That will at least sometimes save you a minor boxing cost. (really, not much)
On the topic of boxing, we’re working on a branch of lucene you can pull from SVN, that changes the internals of Lucene from using boxing containers (ArrayLists, non-generic Lists and HashTables) to a version that uses generics and more .net-friendly things. This is the 2.9.4g branch. .Net’ified, as we like to say. We haven’t officially benchmarked it, but developer tests have show it, in some cases, to be around 200% faster than older versions.
The other thing to keep in mind, Lucene is great as a search engine, you may find that in some cases, it may not stack up to MySQL. Really, though, the only way to know for sure is to just test and try to find performance bottlenecks like some of the ones I mentioned above.
Hope that helps! Don’t forget about the Lucene.Net mailing list (lucene-net-dev@lucene.apache.org), either if you have any questions. Me and the other committers are generally quick to answer questions.