I’m indexing and searching code using a custom analyzer. Given text “will wi-fi work”, following tokens are generated (‘will’ being a stop-word, is eliminated).
wi-fi {position:2 start:5 end:10}
wifi {position:2 start:5 end:10}
wi {position:2 start:5 end:7}
fi {position:2 start:8 end:10}
work {position:3 start:11 end:15}
When I search for terms wi-fi, work I get search results. However, when I issue any query (phrase/non-phrase) for wifi, wi, fi I don’t get any results. Is there anything wrong with the generated tokens?
Parsed search queries:
For wi-fi (works fine)
Lucene's: +matchAllDocs:true +(alltext:wi-fi alltext:wifi alltext:wi alltext:fi)
For wifi (no results returned)
Lucene's: +matchAllDocs:true +alltext:wifi
For “will wi-fi work” (works fine)
Lucene's: +matchAllDocs:true +alltext:"(wi-fi wifi wi fi) work"
For “will wifi work” (no results returned)
Lucene's: +matchAllDocs:true +alltext:"? wifi work"
UPDATE
Found the issue:
public boolean incrementToken() throws IOException
{
/*
* first return all tokens in the list
*/
if (tokens.size() > 0)
{
Token top = tokens.removeFirst();
restoreState(current);
**termAtt.setEmpty().append(new String(top.buffer(), 0, top.length()));**
offsetAtt.setOffset(top.startOffset(), top.endOffset());
posIncrAtt.setPositionIncrement(0);
return true;
}
/*
* if there are no more incoming tokens return false
*/
if (!input.incrementToken())
return false;
Token wrapper = new Token();
wrapper.copyBuffer(termAtt.buffer(), 0, termAtt.length());
wrapper.setStartOffset(offsetAtt.startOffset());
wrapper.setEndOffset(offsetAtt.endOffset());
wrapper.setPositionIncrement(posIncrAtt.getPositionIncrement());
normalizeHyphens(wrapper);
current = captureState();
return true;
}
In bolded line above I was saying
termAtt.setEmpty().append(new String(top.buffer()));
When i search for wi, i wasn’t getting any results but wi* used to give results. Looks like this top.buffer() contains some additional junk which was resulting in weird behavior.
wasted a day on this 🙁
Just guessing without knowing your analyser or parser.