Suppose I have a document collection that I have indexed in Lucene. I submit a query and get hits. Now what I want is to find where in a particular document hit(s) occur(s). I know that I can use the Lucene Highlighting classes to obtain relevant fragments. But how can I find out where exactly these fragments appear in the original contents?
A related question is how to make sure the found fragments are actually very close to the original query? I noticed in my experiments with highlighting that often I would have a multi-word query and it would return fragments that would have only some of these words. But what if I want to make sure I get hits with all the words?
Thanks!
Not an actual answer, just a few links to a solution to a similar problem.
First of all, here you can see the actual results of the highlighting (note that
wereis highlighted thoughamwas in the query. Stemming is an additional feature of this implementation):http://hunglish.hu/search?huSentence=&enSentence=I%20am%20highlighted&size=20&page=2&doc.genre=-10
Here’s the source. Look for these methods:
highlightField,highlightBisenhttp://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Searcher.java
Disclaimer: I wrote this a while ago, it is not very nice code, and it is buggy in special cases: there is an open issue relating to highlighting. Furthermore, it uses version 3.2.0 of the lucene-highlighter, which is possibly not the newest.
Anyway, I hope if you look at how it works, it helps you write a better one, or at least something that works as expected.