First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don’t know enough about how Lucene handles writes and reads.
I’m not sure if it’s a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn’t so quick. It’s by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of
stringandbyte[]as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field’s data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc…).The real killer when used for many pieces, you can’t update a document in an index. When you pull out a document to update it and change the field, the call to
UpdateDocument()actually marks the old document as deleted and inserts a new document.Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It’s temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to
Optimize()will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing
IndexReaderyou have open will still be able to search and read from the index when you’re optimizing it.In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you’re searching through many documents for many documents. It’s an inverted index and it’s good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you’ll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we’ve added generics and removed a bit of the costly boxing done in previous versions.