My application uses Lucene.NET to index various text files. Since each text file is different in structure, the entire content of each file is stored in a single “content” field.
Some of the text files contains URLs, e.g:
http://domain1.co.uk/blah
http://domain2.co.ru/blahblah
etc.
The code I use to index each file is:
Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field("content", contents, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED, Lucene.Net.Documents.Field.TermVector.YES);
Where “contents” is the file contents.
When querying the file, Lucene returns result only when searching for the exact domain name (e.g domain1.co.uk) and nothing is returned for partial domain name (e.g domain1.co).
The code used to build the query is:
Lucene.Net.Index.Term searchTerm = new Lucene.Net.Index.Term("content", "domain1.co");
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
Do you have any idea why must I search using the exact domain name?
The StandardAnalyzer/Tokenizer is the culprit here – it does it’s best to make URLs searchable, but in this case, it will not match a partial hostname. The standard approach is to create a custom analyzer/tokenizer – for this I can point you to another SO question with a similar problem and solution.