I have a Lucene.net index with 10 fields, some stored and some indexed, with 460 million documents. The index is about 250GB. I’m using Lucene.net 3.0.3 and every time I do a search I easily eat up 2GB+ in RAM, which causes my 32 bit application to get out of memory exceptions. I unfortunately cannot run the app as a 64 bit process due to other 32 bit dependencies.
As far as I know I’m following Lucene best practices:
-
One open index writer that writes documents in batches
-
A shared reader that doesn’t close and reopen itself across searches
-
The index searcher has a
termInfosIndexDivisorset to 4, which didn’t seem to make a difference. I even tried setting it to something huge like 1000 but didn’t notice any memory changes. -
Fields that do not need to be subsearched aren’t analyzed (i.e. full string searching only) and fields that don’t need to be retrieved back from the search aren’t stored.
-
I’m using the default
StandardAnalyzerfor both indexing and searching. -
If I prune the data and make a smaller index, then things do work. When I have an index that is around 50GB in size I can search it with only about 600MB of RAM
However, I do have a sort applied on one of the fields, but even without the sort the memory usage is huge for any search. I don’t particularly care about document score, more that the document exists in my index, but I’m not sure if somehow ignoring the score calculation will help with the memory usage.
I recently upgraded from Lucene.net 2.9.4 to Lucene.net 3.0.3 thinking that that might help, but the memory usage looks about the same between the two versions.
Frankly I’m not sure if this index is just too large for a single machine to feasbily search or not. Most examples I find talk about indexes 20-30GB in size or less so maybe this isn’t possible, but I wanted to at least ask.
If anyone has any suggestions on what I can do to make this useable that would be great. I am willing to sacrifice search speed for memory usage if possible.
You CAN run the app in 64 bit – make a separate process for the lucene part, use remoting to communicate with it (or WCF). Finished. Standard Approach.
You think about Splitting it already, so heck, Isolate it and put it on 64 bit.